Limit to the number of reference sequences?

novoalab / EpiNano

Detection of RNA modifications from Oxford Nanopore direct RNA sequencing reads (Liu*, Begik* et al., Nature Comm 2019)

GNU General Public License v2.0

110 stars 31 forks source link

Limit to the number of reference sequences? #75

Closed dennisrm closed 3 years ago

dennisrm commented 3 years ago

I ran EpiNano 1.2.0 and was wondering about the "Ref" column of the output of Epinano_Variant.py (sample.per_site.5mer.csv).

I assumed it was just the number of the reference sequence (the first sequence in ref.fa would be 1, the second 2, etc...), but there seems to be a max of 60 regardless of how many sequences are in my reference file. I tried with alignments to two different references, a genome with hundreds of sequences, and a transcriptome with thousands, but both only go up to 60 for the "Ref" column.

In the documentation there's this note:

Note 2: the users should split the computations for each reference sequences if the reference genome is large.

I assumed that was just to improve computational time, but is there a limit to the number of reference sequences that EpiNano will process at once?

Thanks

Huanle commented 3 years ago

There is no limit to the number of references. The splitting bam/sam suggestion is to reduce computaitonal load and speed up analysis.

As for your issue, what i guess is that you may not have reads mapped to the references missing from the output. You can quickly check this out and let me know if this is the case. samtools view -F 3844 your.bam | cut -f3| sort|uniq

dennisrm commented 3 years ago

Thank you for your response.

When I use that command on the bam files, I see reads mapping to many of the reference sequences. When I add | wc I see that there are 177 reference sequences in the bam that was aligned to the genome, and 2679 for the bam aligned the the transcriptome. But both seem to only have 60 Ref sequence IDs in the output of Epinano_Variant.

I'll double check to make sure I didn't make any mistakes running anything, but I appreciate you letting me know that this isn't the expected output.

EmanueleRaineri commented 3 years ago

@dennisrm you might want to check whether you are using a suitable version of sam2tsv. I used to get those numbers in the Ref field as well, due to the fact that I installed a sam2tsv which is a different version from the misc/sam2tsv distributed with Epinano. This caused problems with parsing, which in turn caused the Ref field to contain the mapping quality instead of the chromosome (or transcript) name. Not sure this will work in your case, but maybe worth a try.

dennisrm commented 3 years ago

Thank you @EmanueleRaineri you were right that it was an issue with sam2tsv.

Using the version in Epinano/misc I was able to get the expected output, with reference sequence names in the Ref column, instead of numbers 1 to 60.

Huanle commented 3 years ago

thanks @EmanueleRaineri!