nanoporetech / pipeline-pinfish-analysis

Pipeline for annotating genomes using long read transcriptomics data with pinfish
Other
27 stars 5 forks source link

clustering step seems to ignore reasonable clusters #6

Closed nhartwic closed 5 years ago

nhartwic commented 5 years ago

I'm currently experimenting with the pinfish pipeline and seeing numerous apparent errors in the output. Things that I would naively expect to cluster together and form a transcript aren't.

The link below is an image taken of IGV showing one such apparent error. The top section is looking directly at the initial bam file generated by the pinfish pipeline. the next section is a reference annotation file created by a third party (using maker I believe but don't quote me). The remaining sections are from files directly output by the pipeline

https://cdn.discordapp.com/attachments/200893092788699136/606974675737641023/pychopper_igv.png

It seems quite clear from the alinments that we have good evidence that a gene is present and even have a pretty good look at the intron structure at that gene, but this gene/transcript isn't present in the pinfish output.

This image is representative of many such apparent errors throughout the output. Presumably this lack of output can be explained by understanding the cluster-gff step but I can't seem to locate any detailed explanations of the algorithm being used. Do you have any advice on clustering parameters I could modify to get more complete output?

bsipos commented 5 years ago

Hello! The reads in your screenshot look quite "ladder-like" with essentially only a single read covering the gene. You would only get this in the output if you lowered the minimum cluster size to 1:

# -c parameter:
minimum_cluster_size: 1

Regarding how the inner workings of the pinfish tools you can refer to the knowledge exchange videos: https://nanoporetech.com/resource-centre/knowledge-exchange-cdna-sequencing-nanopore-technology