metagenome-atlas / atlas

ATLAS - Three commands to start analyzing your metagenome data
https://metagenome-atlas.github.io/
BSD 3-Clause "New" or "Revised" License
377 stars 98 forks source link

finding unmapped reads #616

Closed skeffington closed 1 year ago

skeffington commented 1 year ago

I'm trying to investigate unassembled reads in my dataset. I have some highly dominant taxa according to 16S that are completely missing from the metagenomes, so I'm wondering if this is an assembly problem with the metagenome data or a primer bias issue with the 16S data. I'm just struggling to interpret some of the Atlas output files and wondered if you could provide some guidance:

I started by looking here: /logs/assembly/calculate_coverage/align_reads_from_M50a_to_filtered_contigs.log I see

image

I don't entirely understand this, as it says 0.89M or ~25% of reads are mapped, yet 'reads used' is 6.9M. Is it not 25% of reads used?

The align2.BBWrap command was:

align2.BBWrap build=1 overwrite=true fastareadlen=500 nodisk=t ref=M50a/M50a_contigs.fa
sta in1=M50a/sequence_quality_control/M50a_QC_R1.fastq.gz,M50a/sequence_quality_control/M50a_QC_se.fastq.gz in2=M50a/sequence_quality_control/M50a_QC_R2.fastq.gz,null trimreaddescriptions=t
 out=M50a/sequence_alignment/M50a.sam threads=8 pairlen=1000 pairedonly=t minid=0.9 mdtag=t xstag=fs nmtag=t sam=1.3 local=t ambiguous=best secondary=t saa=f maxsites=10 unpigz=t -Xmx51G

suggesting there should be a sam file output. However, in M50a/sequence_alignment/ I only have a bam output - is this the output of the above command?

I also thought that /genomes/alignments/unmapped might have the unmapped reads from this mapping. Is that the case? The numbers of reads in the fastq files in this directory don't match up with the numbers in the align_reads_from_M50a_to_filtered_contigs.log file however.

Any guidance would be much appreciated!

SilasK commented 1 year ago

But keep in mind that there are 2 (or actually 3 ) points of mapping.

  1. Mapping reads of a sample to its assembled contigs
    • this is the log you showed me above)
    • See reports/assembly_report.html and stats/combined_contig_stats.tsv
  2. Mapping reads to the dereplicated set of genomes
    • This is the files in /genomes/alignments/unmapped
    • The counts are in genomes/counts/counts_genomes.parquet
    • In the latest version of atlas you get also a reports/genome_mapping/results.html
      1. Mapping of reads to the non-redundant gene catalog
    • generate with atlas run genecatalog

However, if you say there is a 25% mapping rate in (1.), it means the assembly is not very good. Maybe this is only one sample, and you still get good results in others.

But I fear if the assembly is not good then the MAGs nor the genecatalog will be comprehensive.

SilasK commented 1 year ago

Can you check the mapping rate of genomes and or genecatalog ?

Can you check what's in a sample?
I suggest to use sendsketch.sh in=reads_R1.fastq.gz in2=reads_R2.fastq,gz (sandsketch is installe dwith atlas)

SilasK commented 1 year ago

If you can not make a good assembly an idea would be to use

  1. a reference of your microbiome if something exists.
  2. Maybe PLASS gene assembler would work
skeffington commented 1 year ago

Thanks for the helpful ideas Silas! I'll have a go in the next few days and post here how I get on.

github-actions[bot] commented 1 year ago

There was no activity since some time. I hope your issue is solved in the mean time. This issue will automatically close soon if no further activity occurs.

Thank you for your contributions.