merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
440 stars 145 forks source link

[BUG] read.reference_end is mysteriously `None`: leads to anvi-profile error #1821

Closed ekiefl closed 3 years ago

ekiefl commented 3 years ago

Description

This error was reported by @SCarrSuperstar in this issue. Moving it here because it is a separate issue that deserves its own space. The error is:

$ anvi-profile -i G3004-02.bam -c GeoBio_metaSPAdes.contigs.db --output-dir G3004profile --sample-name G3004profile
Contigs DB .........................: Initialized: GeoBio_metaSPAdes.contigs.db 
                                      (v. 20)

WARNING
=====================================
Amino acid linkmer frequencies will not be characterized for this profile.

anvio ..............................: 7
profiler_version ...................: 35
sample_id ..........................: G3004profile
description ........................: None
profile_db .........................: /home/ankita/ConePool_Anvio1/G3004profile/PROFILE.db
contigs_db .........................: True
contigs_db_hash ....................: hashb1e89e4d
cmd_line ...........................: /home/ankita/.conda/envs/anvio-7/bin/anvi-profile -i G3004-02.bam -c GeoBio_metaSPAdes.contigs.db --output-dir G3004profile --sample-name G3004profile
merged .............................: False
blank ..............................: False
split_length .......................: 20,000
min_contig_length ..................: 1,000
max_contig_length ..................: 9,223,372,036,854,775,807
min_mean_coverage ..................: 0
clustering_performed ...............: False
min_coverage_for_variability .......: 10
skip_SNV_profiling .................: False
skip_INDEL_profiling ...............: False
profile_SCVs .......................: False
min_percent_identity ...............: None
report_variability_full ............: False

WARNING
=====================================
Your minimum contig length is set to 1,000 base pairs. So anvi'o will not take
into consideration anything below that. If you need to kill this an restart your
analysis with another minimum contig length value, feel free to press CTRL+C.

input_bam ..........................: G3004-02.bam                              
output_dir .........................: /home/ankita/ConePool_Anvio1/G3004profile
num_reads_in_bam ...................: 10,549,203
num_contigs ........................: 20,533
num_contigs_after_M ................: 20,533
num_splits .........................: 21,590
total_length .......................: 150,900,687
[18 Oct 21 13:04:44 Profiling w/1 thread] contigs are being pr (...)  ETA: ∞:∞:∞
✖ anvi-profile encountered an error after 0:00:14.052700
Traceback (most recent call last):
  File "/home/ankita/.conda/envs/anvio-7/bin/anvi-profile", line 98, in <module>
    main(args)
  File "/home/ankita/.local/lib/python3.6/site-packages/anvio/terminal.py", line 843, in wrapper
    program_method(*args, **kwargs)
  File "/home/ankita/.conda/envs/anvio-7/bin/anvi-profile", line 35, in main
    profiler.BAMProfiler(args)._run()
  File "/home/ankita/.local/lib/python3.6/site-packages/anvio/profiler.py", line 283, in _run
    self.profile_single_thread()
  File "/home/ankita/.local/lib/python3.6/site-packages/anvio/profiler.py", line 743, in profile_single_thread
    contig = self.process_contig(bam_file, contig_name, contig_length)
  File "/home/ankita/.local/lib/python3.6/site-packages/anvio/profiler.py", line 697, in process_contig
    split.auxiliary.process(bam_file)
  File "/home/ankita/.local/lib/python3.6/site-packages/anvio/contigops.py", line 215, in process
    self.run_SNVs_and_indels(bam)
  File "/home/ankita/.local/lib/python3.6/site-packages/anvio/contigops.py", line 494, in run_SNVs_and_indels
    for read in read_iterator(self.split.parent, self.split.start, self.split.end, **kwargs):
  File "/home/ankita/.local/lib/python3.6/site-packages/anvio/bamops.py", line 90, in fetch_and_trim
    if read.reference_end - end > 0:
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

Problem

I have identified the problem. Basically, a small percentage of reads from pysam are appearing without the critical attribute reference_end. According to the pysam documentation, "[reference_end] returns None if not available (read is unmapped or no cigar alignment present)".

Well things certainly look like they are mapped and have a cigar alignment! Here is one such read that causes the error:

<anvio.bamops.Read object at 0x1552a98d0>
 ├── start, end : [65898, None)
 ├── cigartuple : [(0, 113), (4, 43)]
 ├── read       : GGTCATGTTGACAAATGAATGGGTGTTGCGGCGCTGTGCCGCACTCGTCTGTTGTCGTTATATTGCAAAACACAGATGTTCGGCGCACTTTTTGGGGTCATGCCGAGCACCGCTTGTTCTGTCATTCCGAGCGGTAGCGAGGAATCCAGTGCGATG
 └── reference  : GGTCATGTTGACAAATGAATGGGTGTTGCAGCGCTGTGCCGCACTCGTCTGTTGTCGTTATATTGCAAAACACAGATGTTCGGCGCACTTTTTGGGGTCATGCCGAGCACCGC-------------------------------------------

It has a CIGAR tuple and clearly maps to a specific point in the reference sequence. Yet, when I check the attribute is_unmapped, indeed, it returns True.

I have no good explanation for this. There seems to be little rhyme or reason to which reads have the attribute is_unmapped set to True. Here is a spattering of problem reads--maybe it will make sense to someone else:

<anvio.bamops.Read object at 0x154c5e438>
 ├── start, end : [3346, None)
 ├── cigartuple : [(4, 1), (0, 151), (4, 4)]
 ├── read       : ATGCTTAGTCCTCCTACTTCTAGTCTAACTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCACCGACTCCCACCGGTGTTTCGGTTCACTATACTACAGATGGGAGCACACCGAATTGTAGTAGTCCTTCG
 └── reference  : -TGCTGAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCACCGACTCCCACCGGTGTTTCGGTTCACTATACTACAGATGGGAGCACACCGAATTGTAGTAGTCC----
13793
<anvio.bamops.Read object at 0x154c5e438>
 ├── start, end : [3346, None)
 ├── cigartuple : [(4, 43), (0, 68), (4, 45)]
 ├── read       : GCTTCTTTTTTACAGTTAGGCTCTTATAGAGTGACCTATGGGATGCTTAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCACCGACTCCCACCGGTGTTTCGGTTCAC
 └── reference  : -------------------------------------------TGCTGAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACT---------------------------------------------
13794
<anvio.bamops.Read object at 0x154c5e438>
 ├── start, end : [3346, None)
 ├── cigartuple : [(4, 11), (0, 145)]
 ├── read       : GACCAATGGGATGCTTAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCACCGACTCCCACCGGTGTTTCGGTTCACTATACTACAGATGGGAGCACACCGAATTGTAG
 └── reference  : -----------TGCTGAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCACCGACTCCCACCGGTGTTTCGGTTCACTATACTACAGATGGGAGCACACCGAATTGTAG
13795
<anvio.bamops.Read object at 0x154c5e438>
 ├── start, end : [3346, None)
 ├── cigartuple : [(4, 43), (0, 68), (4, 45)]
 ├── read       : GCTTCTTTTTTACAGTTAGGCTCTTATAGAGTGACCTATGGGATGCTTAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCACCGACTCCCACCGGTGTTTCGGTTCAC
 └── reference  : -------------------------------------------TGCTGAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACT---------------------------------------------
13796
<anvio.bamops.Read object at 0x154c5e438>
 ├── start, end : [3346, None)
 ├── cigartuple : [(4, 114), (0, 42)]
 ├── read       : GTCCTGTTTTTATTCCACAACCTAGCCTAACTCCTCTGATACTAAAAGCTATTGCTTGTGTAAATGGAGAGGCTTCTTTTTTACAGTTAGGCTCTTATAGAGTGACCTATGGGATGCTTAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCT
 └── reference  : ------------------------------------------------------------------------------------------------------------------TGCTGAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCT
13797
<anvio.bamops.Read object at 0x154c5e438>
 ├── start, end : [3348, None)
 ├── cigartuple : [(4, 99), (0, 57)]
 ├── read       : AGGGAACCACTGAAGGGTGTGGAGATCCGTGCGATTGTCTGTAAGGACACACTCTCTTCGGCTATACAGACCGGTCTATATAGAATCACAAATGGACAACTGAGTCCTCCTACTTCTAGTCTAACTACCAATCCCCCCTCACCTCCTTACACTTCA
 └── reference  : ---------------------------------------------------------------------------------------------------CTGAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCA
13798
<anvio.bamops.Read object at 0x154c5e438>
 ├── start, end : [3348, None)
 ├── cigartuple : [(4, 71), (0, 85)]
 ├── read       : GTGCGATTGTCTGTAAGGACACACTCTCTTCGGCTATACAGACCGGTCTATATAGAATCACAAATGGACAACTGAGTCCTCCTACTTCTAGTCTAGCTACCAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCAC
 └── reference  : -----------------------------------------------------------------------CTGAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCAC
13799
<anvio.bamops.Read object at 0x154c5e438>
 ├── start, end : [3348, None)
 ├── cigartuple : [(4, 49), (0, 107)]
 ├── read       : ACTCTCTTCGGCTATACAGACCGGTCTATATAGAATCACAAATGGACAACTGAGTCCTCCTACTTCTAGTCTAACTACCAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCACCGACTCCCACCGGTGTTTCGGT
 └── reference  : -------------------------------------------------CTGAGTCCTCCTACTTCTAGTCTAGCTACAAATCCCCCCTCACCTCCTTACACTTCAGCCCAGACTACGACTCTTTCACCTCCACCGACTCCCACCGGTGTTTCGGT
<anvio.bamops.Read object at 0x154d71d68>
 ├── start, end : [3206, None)
 ├── cigartuple : [(0, 156)]
 ├── read       : GACGCAGTCTCGCGCATCAAGGTAAGCGCGGCACGGACAGCGACATTTTCCGCTGTGTAATACTTCCCGCCAATGCCGTTAATCGTGACAGTGCTGGCACCGCTGGCATCGCGCGGCGTTTCTACTGCACCAACTCCGCCCGTTGCCAAGTTCCCG
 └── reference  : GACGCAGTCTCGCGCATCAAGGTAAGCGCGGCACGGACAGCGACATTTTCCGCTGTGTAATACTTCCCGCCAATGCCGTTAATCGTGACAGTGCTGGCACCGCTGGCATCGCGCGGCGTTTCTACTGCACCAAGTCCGCCCGTTGCCAAGTTGGCT
13783
<anvio.bamops.Read object at 0x154d71048>
 ├── start, end : [3236, None)
 ├── cigartuple : [(0, 123), (4, 33)]
 ├── read       : GCACGGACAGCGACATTTTCCGCTGTGTAATACTTCCCGCCAATGCCGTTAATCGTGATGGTACTTGCACCGCTGGCATCGCGCGGCGTTTCAACCGCACCAACTCCGCCTGTCGCCAAGTTGCCGAAGCCGTTAAATTGAAATAGTAGTGCGGCA
 └── reference  : GCACGGACAGCGACATTTTCCGCTGTGTAATACTTCCCGCCAATGCCGTTAATCGTGACAGTGCTGGCACCGCTGGCATCGCGCGGCGTTTCTACTGCACCAAGTCCGCCCGTTGCCAAGTTG---------------------------------
13784
<anvio.bamops.Read object at 0x154d71d68>
 ├── start, end : [3251, None)
 ├── cigartuple : [(0, 108), (4, 48)]
 ├── read       : TTTTCCGCTGTGTAATACTTCCCGCCAATGCCGTTAATCGTGACAGTGCTGGCACCGCTGGCATCGCGCGGCGTTTCTACTGCACCAACTCCGCCCGTTGCCAAGTTGCCGAAGCCGTTAAATTGAAATACAGTAATTTCTTGCGCCTGAACGCAA
 └── reference  : TTTTCCGCTGTGTAATACTTCCCGCCAATGCCGTTAATCGTGACAGTGCTGGCACCGCTGGCATCGCGCGGCGTTTCTACTGCACCAAGTCCGCCCGTTGCCAAGTTG------------------------------------------------
13785
<anvio.bamops.Read object at 0x154d71048>
 ├── start, end : [3302, None)
 ├── cigartuple : [(4, 8), (0, 57), (4, 91)]
 ├── read       : TGCTACTTGCACCGCTGGCATCGCGCGGCGTTTCTACTGCACCAACTCCGCCCGTTGCCAAGTTGCCGAAGCCGTTAAATTGAAATACAGTAATTTCTTGCGCCTGAACGCAAGACATAAAGAGAAAAAATATGCACGCTGTAAGGATTGGAAGTT
 └── reference  : --------GCACCGCTGGCATCGCGCGGCGTTTCTACTGCACCAAGTCCGCCCGTTGCCAAGTTG-------------------------------------------------------------------------------------------
13786
<anvio.bamops.Read object at 0x154d71048>
 ├── start, end : [3304, None)
 ├── cigartuple : [(0, 55), (4, 101)]
 ├── read       : ACCGCTGGCATCGCGCGGCGTTTCTACTGCACCAACTCCGCCCGTTGCCAAGTTGCCGAAGCCGTTAAATTGAAATACAGTAATTTCTTGCGCCTGAACGCAAGACGCAATGAGAAAAAATATGCACGCTGTAAGGAGTGAAAGTTTCATTGGAAG
 └── reference  : ACCGCTGGCATCGCGCGGCGTTTCTACTGCACCAAGTCCGCCCGTTGCCAAGTTG-----------------------------------------------------------------------------------------------------

In total, 18,472 reads had is_unmapped set to True while simultaneously having CIGAR tuples and therefore seemingly aligning to the references. This represents a very small percentage of the 10,549,203 total reads.

Solution

The solution I see is to ignore any reads with is_unmapped set to True.

meren commented 3 years ago

That is for documenting this. I see no problem with this fix, and I don't think there is any risk to break things :)

But I'm also surprised that this didn't occur until now. I wonder if it is due to a bug in the mapping software.

@SCarrSuperstar, can you share with us what did you use for read recruitment and whether there were some unusual parameters?

Best,

SCarrSuperstar commented 3 years ago

Thanks @ekiefl and @meren.

I think I know the problem. This may be a result of a collaborated effort. My collaborator actually made the co-assembly, and he didn't know that I already ran quality control on the files, so they did some extra QC screening. I'm now realizing that it is possible that my PE read files have sequences that won't map to the co-assembly which I used to make my contig.db file.

Thanks again so much for your time. I hope this maybe helps someone else, and gives me a reason to practice some Pysam. I'm going to try to remove these unmapped reads from the bam files I have so I don't have to start over.

ekiefl commented 3 years ago

Hi @SCarrSuperstar,

No problem, we are happy to help.

I'm going to try to remove these unmapped reads from the bam files I have so I don't have to start over.

If you install the version of anvi'o we plan to release in the next day or two, you won't have to remove these reads from the bam files or start over. anvi-profile will run without error on your current contigs db and bam file.

SCarrSuperstar commented 3 years ago

That sounds even better!