mflamand / Bullseye

Bullseye analysis pipeline for DART-seq analysis
MIT License
12 stars 4 forks source link

documentation fix: encode annotation refFlat not working (ucsc table browser, gtfToGenePred) #5

Closed ckuenne closed 2 years ago

ckuenne commented 2 years ago

hi,

i followed your description for extraction of a refFlat file from ucsc based on encode annotation (https://github.com/mflamand/Bullseye, point 2.). this lead to a crash of Find_edit_site.pl:

perl Find_edit_site.pl  --annotationFile ./ucsc_refflat/homo_sapiens_hg38.txt --EditedMatrix ko-dart.matrix.gz --controlMatrix wt-dart.matrix.gz --minEdit 5 --maxEdit 90 --editFoldThreshold 1.5 --MinEditSites 3 --cpu 16 --outfile output.bed --fallback /mnt/flatfiles/organisms/new_organism/homo_sapiens/104/homo_sapiens.104.mainChr.fa --verbose
processing using 16 cpu cores
Processing files ./ucsc_refflat/homo_sapiens_hg38.txt, wt-dart.matrix.gz, ko-dart.matrix.gz and /mnt/flatfiles/organisms/new_organism/homo_sapiens/104/homo_sapiens.104.mainChr.fa
Detecting C - to - T transitions
Minimum coverages of 10 for control matrix and 10 for edited matrix
Detecting bases with 0.05 to 0.9 edit
Miminum enrichement of editing of control is 1.5
loading annotation file ./ucsc_refflat/homo_sapiens_hg38.txt... 
Use of uninitialized value $eString in concatenation (.) or string at ./Find_edit_site.pl line 248, <$annotation_handle> line 106192.
Use of uninitialized value $eString in split at ./Find_edit_site.pl line 251, <$annotation_handle> line 106192.
...

this is due to a wrong format of the refflat file, where the first field is missing (gene name):

#name   chrom   strand  txStart txEnd   cdsStart    cdsEnd  exonCount   exonStarts  exonEnds
ENST00000684719.1   chr1    -   67092164    67134970    67093004    67127240    8   67092164,67095234,67096251,67115351,67125751,67127165,67131141,67134929,    67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134970,

default gtfToGenePred on encode gtfs will also not work, because it misses the same field.

workaround 1: use ucsc table browser as you described, additionally add field name2, rearrange file to have the correct order of columns (#name2 name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds)

workaround 2: use gtfToGenePred on encode gtf with parameter -genePredExt, remove unnecessary columns and move gene_id up front

gtfToGenePred -genePredExt homo_sapiens.104.mainChr.gtf homo_sapiens.104.mainChr.genepredext

ENST00000684719 chr1    -   67092164    67134970    67093004    67127240    8   67092164,67095234,67096251,67115351,67125751,67127165,67131141,67134929,    67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134970,    0   ENSG00000203963 cmpl    cmpl    0,2,1,2,0,0,-1,-1,

# reformat ->

ENSG00000203963 ENST00000684719 chr1    -   67092164    67134970    67093004    67127240    8   67092164,67095234,67096251,67115351,67125751,67127165,67131141,67134929,    67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134970,    
mflamand commented 2 years ago

Thanks for catching this, I have updated the instructions in the README to reflect this. I am currently out of the office, If you don't mind, we can leave this Issue up until I come back to the office next week and I can make more thorough testing. In the meantime if someone else has issues than can refer to this post.