documentation fix: encode annotation refFlat not working (ucsc table browser, gtfToGenePred)

hi,

i followed your description for extraction of a refFlat file from ucsc based on encode annotation (https://github.com/mflamand/Bullseye, point 2.). this lead to a crash of Find_edit_site.pl:

perl Find_edit_site.pl  --annotationFile ./ucsc_refflat/homo_sapiens_hg38.txt --EditedMatrix ko-dart.matrix.gz --controlMatrix wt-dart.matrix.gz --minEdit 5 --maxEdit 90 --editFoldThreshold 1.5 --MinEditSites 3 --cpu 16 --outfile output.bed --fallback /mnt/flatfiles/organisms/new_organism/homo_sapiens/104/homo_sapiens.104.mainChr.fa --verbose
processing using 16 cpu cores
Processing files ./ucsc_refflat/homo_sapiens_hg38.txt, wt-dart.matrix.gz, ko-dart.matrix.gz and /mnt/flatfiles/organisms/new_organism/homo_sapiens/104/homo_sapiens.104.mainChr.fa
Detecting C - to - T transitions
Minimum coverages of 10 for control matrix and 10 for edited matrix
Detecting bases with 0.05 to 0.9 edit
Miminum enrichement of editing of control is 1.5
loading annotation file ./ucsc_refflat/homo_sapiens_hg38.txt... 
Use of uninitialized value $eString in concatenation (.) or string at ./Find_edit_site.pl line 248, <$annotation_handle> line 106192.
Use of uninitialized value $eString in split at ./Find_edit_site.pl line 251, <$annotation_handle> line 106192.
...

this is due to a wrong format of the refflat file, where the first field is missing (gene name):

#name   chrom   strand  txStart txEnd   cdsStart    cdsEnd  exonCount   exonStarts  exonEnds
ENST00000684719.1   chr1    -   67092164    67134970    67093004    67127240    8   67092164,67095234,67096251,67115351,67125751,67127165,67131141,67134929,    67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134970,

default gtfToGenePred on encode gtfs will also not work, because it misses the same field.

workaround 1: use ucsc table browser as you described, additionally add field name2, rearrange file to have the correct order of columns (#name2 name chrom strand txStart txEnd cdsStart cdsEnd exonCount exonStarts exonEnds)

workaround 2: use gtfToGenePred on encode gtf with parameter -genePredExt, remove unnecessary columns and move gene_id up front

gtfToGenePred -genePredExt homo_sapiens.104.mainChr.gtf homo_sapiens.104.mainChr.genepredext

ENST00000684719 chr1    -   67092164    67134970    67093004    67127240    8   67092164,67095234,67096251,67115351,67125751,67127165,67131141,67134929,    67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134970,    0   ENSG00000203963 cmpl    cmpl    0,2,1,2,0,0,-1,-1,

# reformat ->

ENSG00000203963 ENST00000684719 chr1    -   67092164    67134970    67093004    67127240    8   67092164,67095234,67096251,67115351,67125751,67127165,67131141,67134929,    67093604,67095421,67096321,67115464,67125909,67127257,67131227,67134970,

mflamand / Bullseye

documentation fix: encode annotation refFlat not working (ucsc table browser, gtfToGenePred) #5