nanoporetech / pinfish

Tools to annotate genomes using long read transcriptomics data
Other
44 stars 13 forks source link

pinfish annotation for prokaryotic genomes fails to collapse reads #4

Closed felixgrunberger closed 5 years ago

felixgrunberger commented 5 years ago

I used the snakemake pinfish pipeline to annotate E. coli K12 from direct RNA seq reads using standard parameters (only increasing sensitivity by lowering the -c parameter). After polishing and collapsing there are still many overlapping reads left. I was playing around with the collapse parameters but until now nothing really worked. Any ideas?

Genome file was downloaded from the NCBI https://www.ncbi.nlm.nih.gov/nuccore/U00096.2. GFF3: ecoli_k12.txt

GFF output from snakemake pinfish: clustered_transcripts_collapsed.txt polished_transcripts_collapsed.txt polished_transcripts.txt clustered_transcripts.txt

bsipos commented 5 years ago

Could you please also post a link to the base annotation you compared the results to.

felixgrunberger commented 5 years ago

Updated my initial comment with the base annotation file.

bsipos commented 5 years ago

I have looked at your GFF files and collapsing seems to be working as intended. When sequencing eukaryotic mRNA using direct RNA the 3' of the transcript can be trusted more as in the absence of the poly(A) tail the molecule will not get sequenced. Hence the collapsing tool assigns the transcripts to loci based on the proximity of their 3' end and then it discards the transcripts which are likely not to be complete on the 5'end. This will not necessarily discard contained transcripts which have 3'end further away.

Since you have prokaryotic mRNA (which assume you polyadenylated) the 3' ends cannot be trusted as much. So in the case of your data I think it makes sense to discard contained transcripts, but not necessarily to merge all overlapping ones. Discarding the contained transcripts can be done by gffread -M

Best, Botond

felixgrunberger commented 5 years ago

Thanks for your help! I try to figure out what solution fits best to our data and let you know.