saturation analysis - Githubissues

AlisaGU commented 9 months ago

Hi, I am confused about the y-axis of saturation analysis from the paper of TEtranscripts. Could you describe this in detail?

olivertam commented 9 months ago

Hi,

Thank you for your question.

We took the datasets as described in the legend, and aligned them against the relevant genome (with the "ground truth" where all alignments were reported for each read with no limit). By using the cutoff of maximum alignments (i.e. reads with > maximum cutoff are considered unmappable), the graphs show how the proportion of possible multi-reads (those that can be aligned ambiguously) increases from 70% (when reporting reads mapping <=10 times) to 90% (when reporting reads <=100 times). Increasing the limit to 200 doesn't recover many additional multi-mappers, suggesting that most of the rest are probably sequences that map to regions of low complexity that exist in >200 locations in the genome (e.g. poly-nucleotide tracts).

Thus, the paper suggests that the cutoff limit of 100 provides the right balance for sensitivity and efficiency.

Thanks.

AlisaGU commented 9 months ago

Thanks for your reply! I am still not clear about the computation of the proportion of possible multi-reads. Could you give an example?

AlisaGU commented 9 months ago

Is it the proportion of multi-reads in all mappable reads？

olivertam commented 9 months ago

Hi,

Example:

20000 total reads
Based on returning all alignments per read
- 10000 reads that map uniquely to the genome (these are unique reads, and thus not considered for downstream)
- 7000 reads that map between 2-10 times (inclusive) in the genome
- 1500 reads that map between 11-50 times (inclusive) in the genome
- 500 reads that map between 51-100 times (inclusive) in the genome
- 70 reads that map between 101-150 times (inclusive) in the genome
- 30 reads that map between 151-200 times (inclusive) in the genome
- 900 reads that map >200 times in the genome
Total number of possible multi-reads (align >1 time in the genome): 10000 (7000 + 1500 + 500 + 70 + 30 + 900)

Thus, the possible multi-reads (denominator) is the total number of reads that could be aligned > 1 time in the genome, and the numerator is the number of multi-reads that would have been reported given the maximum alignment cut-off (since those that exceed the cutoff are not returned as aligned)

From previous example:

Maximum cutoff of 10 alignments would return 7000 multi-reads (align 2-10 times inclusive), which would be the numerator. Thus the proportion would be 7000 / 10000 = 0.7
Maximum cutoff of 50 alignments would return 8500 multi-reads (align 2-50 times inclusive, 7000 + 1500), which would give the proportion of 8500 / 10000 = 0.85
Maximum cutoff of 100 alignments would return 8500 multi-reads (align 2-100 times inclusive, 7000 + 1500 + 500), which would give the proportion of 9000 / 10000 = 0.9
Maximum cutoff of 150 alignments would return 8500 multi-reads (align 2-150 times inclusive, 7000 + 1500 + 500 + 70), which would give the proportion of 9070 / 10000 = 0.907
Maximum cutoff of 200 alignments would return 8500 multi-reads (align 2-200 times inclusive, 7000 + 1500 + 500 + 70 + 30), which would give the proportion of 9100 / 10000 = 0.91

Please let me know if this does not address your question.

Thanks.

AlisaGU commented 9 months ago

Wow, so detailed!

If I set the maximum cutoff to 100, how can I get the number of reads (align > 100)? Is there a way to know this number (1000 in this example)?

olivertam commented 9 months ago

Hi,

It depends on the aligner you use. For STAR, you can set the --outFilterMultimapNmax to a very high number (e.g. 3000000). You should be aware that this could create a HUGE output file and take quite a while. Once you have that output, you could look at the NH:i: tag and get the number of alignments for each read, and collate them to get your desired output. This can then be used as your "ground truth" by looking at the number with NH:i > 1.

Thanks.

AlisaGU commented 9 months ago

Thanks, I get it! Actually, I have a huge genome (47G) with high proportion of TE. I plan to extract some reads randomly to test maximum cutoff for hisat2 (just for its high speed). Do you have any idea about the suggested proportion of test reads in all reads?

olivertam commented 9 months ago

Hi,

Are you simulating reads, or taking a "real" library to test? I might argue for the latter since that might reflect what your experimental output.

With hisat2, you can provide the --all parameter to get all the possible alignments. It's just really slow (as the hisat2 manual indicated).

However, another approach is to test the various cutoffs of interest, and then plot the number of reads mapped. You could then use the total number of reads as your denominator, and then plot either all mapped reads (in proportion to all reads), or just those with >1 alignment. You should hopefully see a decreasing return in the number of reads (e.g. <1% increase in mapped reads), and determine the best alignment cutoff that way. In essence, you're reproducing the graph, but rather than evaluating by looking at the proportion, you're assessing where you're "reaching saturation" based on the slope (defined by the increase in mapped reads with each increasing cutoff).

Please let me know if that's unclear. Thanks.

AlisaGU commented 9 months ago

a real RNA-seq dataset.

Everything is clear now! You are so patient! Thanks!

AlisaGU commented 9 months ago

Wait a moment ha-ha-ha!

My aim is to extract some reads from a real dataset to form a test data. However, I am not sure whether 10% or 5% is a good choice to form this test data.

olivertam commented 9 months ago

Ah ok.

How large is your RNA-seq library?

I have limited experience with HiSat2 (in terms of speed with increasing -k), so I haven't really subset the data for testing (I have run 30Gb files through STAR in parallel to get the results we need).

The only downside of sampling is that (without knowing your genome and transcriptome profile), most libraries (human/mouse/fly) we've worked with tend to have a minority of reads being multimapping (~20%), so downsampling too much might remove some of the "high multimappers". I would want to try at least 20 to 30% of the library, just to get a deep enough idea, but that is not based on any empirical measurement of your data.

Another approach is to run HiSat2 looking for unique mappers only with increasing number/proportion of reads from your RNA-seq library. You can then determine if the % of uniquely mapped reads stay relatively constant between the runs, and probably take the proportion that is most reflective of looking at all reads.

Of course, if you're taking smaller portion of the library, I would recommend resampling a few times to capture the variability of your test library.

Thanks.

AlisaGU commented 9 months ago

How large is your RNA-seq library?

About 20G.

I have run 30Gb files through STAR in parallel

Could you share the time and parallel number?

My Partners believe the hisat2 has a faster speed, but I have no idea about that. How about you?

AlisaGU commented 9 months ago

20G is the size of one of the paired-end data

olivertam commented 9 months ago

Hi,

I typically use 10 cores for each STAR run. I haven't run a "all-alignment recently", but with a cutoff of 100, it takes an hour to process 11Gb of paired-end data. Let me quickly test a PE 11Gb library with "all alignments", and get back to you.

Thanks.

olivertam commented 9 months ago

Just a quick update/estimate. Running STAR with 10 cores, PE library of 151 million reads, getting all alignments takes ~3hrs

Not sure how that compares with hisat2. However, if you want downsample multiple times and run the analysis in parallel, it should give you a decent idea of what the best alignment cutoff would be.

Thanks.

AlisaGU commented 9 months ago

Coooooool!!!!

Thank you very much for your patience!

AlisaGU commented 9 months ago

Hi, I reopen this issue. I tested two samples using STAR with a maximum cutoff of 1000 and used this read set as a ground truth. Unexpectedly, the curve seemed to saturate with a cutoff of 500 or 600. This value (500 or 600) is far beyond the recommended value, and I am afraid that there may be some factors I ignored. The y axis is the proportion of multi-reads.

olivertam commented 9 months ago

Hi,

This is very interesting. There are a few explanations for this: 1) This is a highly repetitive genome. You mentioned that it's a large genome (47Gb), so perhaps there's a lot of sequence duplication that contributes to the high mapping rate. 2) The repetitive portion of the genome is transcribing a lot of reads. I am curious to know what the proportion of multi-mapping reads are compared to uniquely mapping. 3) One of your library might have DNA contamination, so you're sequencing genomic DNA rather than transcripts, and thus highly repetitive (if scenario 1 is true). This is less likely, but could be a possibility. You can check that by looking at the stranded-ness of your RNAseq library (assuming you're using stranded RNA library prep kits).

One thing that I would check is to look at the reads that are mapping 500 times, and see whether they are aligning to a region of interest (e.g. TE), or to things that you're not interested in (e.g. rRNA or other structural RNA). This might dictate whether the high alignment cutoff is required to capture features of interest.

Thanks.

AlisaGU commented 9 months ago

a large genome (70Gb)

47G

the proportion of multi-mapping reads are compared to uniquely mapping.

When the cutoff was set to 1000, unique mapping reads account for 0.1872083 and 0.02694607.

Let me check these two remaining factors.

In addition, I found a read peak in 200 (corresponding to 0.1872083) and 300~500 (corresponding to 0.02694607)

olivertam commented 9 months ago

Hi,

Wow, that's a very low uniquely aligned rate. This suggests that your genome is heavily duplicated, and so even genes are duplicated multiple times (perhaps not too surprising if it's a plant genome). Yes, I think you're right that your optimal cutoff might be 500 (especially given your second graph). I would also check those around 700, and see if they are repetitive sequences that you are interested in, or other repetitive elements.

I also wonder what the difference between the two samples are, whether you're expecting them to be different (e.g. due to different treatments/mutants). This might dictate your cutoff if you want to capture changes that are reflected in your second sample (vs just the first).

Thanks.

AlisaGU commented 9 months ago

Thanks, let me check it!

AlisaGU commented 9 months ago

I am back! I extracted the reads covering 100-300 loci and covering 250-450, and 650-750 loci respectively, and intersected these reads with gene or TE. Lots of reads overlapped with TE (>90%). About 500 TEs' read counts are greater than 100. Is it a great improvement to TE expression quantification compared to the cost to increase maximum cutoff?

olivertam commented 9 months ago

Hi,

This really depends on whether those TE are of particular interest to your organism (which is probably outside of my expertise). Without knowing the types of TE that are highly repetitive (and assuming that these are not just simple repeats or low complexity sequences), I would probably err on capturing them, especially if these TE copies share so much sequence similarity, it might suggest that they are close to consensus and thus potentially active.

Again, I would have to defer to an expert who knows more about your genome (and TE within) to see if those highly repetitive TE are worth quantifying.

Sorry if that's not as helpful as you like.

Thanks.

AlisaGU commented 9 months ago

So many useful suggestions you have provided! I will ask for an expert for further exploration.

Thanks again!

mhammell-laboratory / TEtranscripts

saturation analysis #151