cluster_gff option - Githubissues

ljw90607 commented 5 years ago

Dear @bsipos,

I'd like to clarify about the options which are provided in cluster_gff. Would these 2 options below be referring to the number of bases in each information?

-d int : Exon boundary tolerance. (default 10) -e int : Terminal exons boundary tolerance. (default 30)

and If so for the option -d, would this be tolerating 10bp in both side of exon when default?

Also for the cluster option, is this referring to the number of supporting reads for that cluster?

-c int : Minimum cluster size. (default 10)

Thank you very much for your help!

Jungwoo

bsipos commented 5 years ago

Hi! All the options work as you described.

Cheers, Botond

ljw90607 commented 5 years ago

Dear @bsipos,

Thank you for your generous help. I had some other question possibly related to the option of cluster_gff. I generated the transcript consensus using cluster_gff (using default option). From the result, I was expecting the transcripts that have a similar structure would be collapsed in the result .gff file, but from looking at the example as below, I do see the clustered reads from .gff still seems to be remained not clustered although their structure seems to be quite similar.

I did change around the -d (exon), and -e (terminal exon) options up to 100bp, but still gave me a similar result as shown above. How could I be able to cluster all these transcripts and simply call them as one cluster? Would you be able to give me some advice? Thank you again for your wonderful help!

Jungwoo

bsipos commented 5 years ago

Could you please post a screenshot of these gff entries from start to end, including the scale bar so I can judge the sequence length? I would also help if you could just post here the relevant entries from the input GFF file and the command line you used to run the tool.

ljw90607 commented 5 years ago

Dear @bsipos Here's a screenshot of GAPDH gene from the gff file.

If you need anything else, please do let me know. Thank you again!

Jungwoo

bsipos commented 5 years ago

That looks odd indeed. I will need the relevant entries form the GFF and command line used for clustering.

ljw90607 commented 5 years ago

Dear @bsipos,

Here is the first 100,000 rows of clustered gff file since the original was too big to include. Another thing I should mention is that the input to generate clustered.gff was combined gff files from multiple sample. I ran spliced_bam2gff in multiple samples and combined the .gff output for cluster_gff input. I don't know whether that caused any issue shown as above.

Thank you again for your wonderful help!

Jungwoo

bsipos commented 5 years ago

The input of cluster_gff must be a sorted GFF file. Since you concatenated GFFs from different samples this is definitely not the case and it is the likely cause of this issue. Please merge the sorted BAM files from the different samples, make sure that is is sorted and then run spliced_bam2gff on it. The you can use its output as an input to cluster_gff. Even better, I suggest you use the snakemake pipeline for running the tools. Please let me know if the issue persists when analysing the data this way.

Best regards, Botond

bsipos commented 5 years ago

So, I just put together small tool to sort concatenated GFFs by transcript positions which might also help you. You can download it from here. Let me know if it is useful.

Botond

ljw90607 commented 5 years ago

Dear @bsipos

Thank you very much for your wonderful help.

I will try to do the analysis with the merged and sorted bam and also with gff by running through the sorting tool you have provided I will let you know how it goes! Thank you again!

Jungwoo

ljw90607 commented 5 years ago

Dear @bsipos,

It seems to be working properly with the merged & sorted bam file. For the sake of later usage, could you explain how I could use the concatenating tool you have provided? I couldn't figure out how to use it since it was in binary format.

Thank you for your great help again!

Jungwoo

nanoporetech / pinfish

cluster_gff option #12