Process for determining % of data binned?

maggieweng commented 12 months ago

Hi there! I am currently working with ATLAS on a large dataset with many samples from different locations. I am trying to determine what percentage of my reads were 1) incorporated into MAGs derived from that sample or 2) mapped to MAGs from any sample in the dataset. I am not sure how to find the data for 1), but currently the way I am calculating 2) is by using the raw_counts_genomes.tsv file in the "genomes/counts" folder. I have summed up all the mapped reads across all MAGs in each sample and then divided by the total number of QC-passing reads for that sample to get the percent of reads that mapped. However, sometimes I get a percentage greater than 100. I understand this may be because the mapping process allows for multi-mapping of one read to many bins and I wanted to check that this is the case, and also that the way I'm going about this calculation is correct?

If there is a file that contains the proportion of reads from each sample used in binning that would also be very helpful, as I can't locate that in the file structure.

Thanks!

SilasK commented 12 months ago

Hey @maggieweng. It is always a good Idea to look at the mapping rate.

To get 2) I suggest you to look at the reports/genome_mapping html + csv.

For 1) I suggest to use the coverage information on the assembly <sample>/assembly/contig_stats/postfilter_coverage_stats.txt and combining it with <sample>/binning/<binner>/cluster_attribution.tsv . Probably you want to remove low-quality bins for your calculation.

Note: Atlas indeed gives you the total number of reads in the report/read_stats.tsv. However, there is an error in how the total number of reads is calculated. I should fix this but in the meantime, you should take the paired-end reads and multiply with 2.
This would give you ~50% for most of the samples if you take the counts as mapped.

github-actions[bot] commented 10 months ago

There was no activity since some time. I hope your issue is solved in the mean time. This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.

metagenome-atlas / atlas

Process for determining % of data binned? #690