immcantation / presto

pRESTO is part of the Immcantation analysis framework for Adaptive Immune Receptor Repertoire sequencing (AIRR-seq). pRESTO is a bioinformatics toolkit for processing high-throughput lymphocyte receptor sequencing data.
https://presto.readthedocs.io
GNU Affero General Public License v3.0
0 stars 0 forks source link

Error during BuildConsensus #60

Closed ssnn-airr closed 6 years ago

ssnn-airr commented 6 years ago

Original report by Koda Hirotomo (Bitbucket: [Hirotomo Koda](https://bitbucket.org/Hirotomo Koda), ).


We're trying pRESTO to process UMI-Barcoded Ion Torrent PGM BCR mRNA (400bp) (which is indel error-prone like Roche 454 pyrosequencing). Reads are single-ended and all reversely oriented (start with 12bp UMI with fixed nucleotides inserted, and then C-region primer sequences). We seem to have no problem in QC, UMI annotation and primer masking (both sides) steps.

During BuildConsensus step, frequently but not always (probably depends on datasets) this error happens in our run. Error cases always show 'float division by zero' messages. I have to manually kill the process after this error.

BuildConsensus.py -s plusVJ_primers-pass.fastq --bf BARCODE --pf PRIMER -n 5 --prcons 0.6 --maxerror 0.2 --maxgap 0.5 --outname plus --log BC5.log
           START> BuildConsensus
            FILE> plusVJ_primers-pass.fastq
   BARCODE_FIELD> BARCODE
       MIN_COUNT> 5
   MIN_FREQUENCY> 0.6
     MIN_QUALITY> 0
         MAX_GAP> 0.5
    PRIMER_FIELD> PRIMER
PRIMER_FREQUENCY> 0.6
       MAX_ERROR> 0.2
   MAX_DIVERSITY> None
       DEPENDENT> False
     COPY_FIELDS> None
    COPY_ACTIONS> None
           NPROC> 24

PROGRESS> 02:58:03 |#                   |   5% ( 2,782) 3.6 minError processing sequence set with ID: CGGGATAGCTGTGGGA
PID 176:  Error in sibling process detected. Cleaning up.
PID 151:  Error in sibling process detected. Cleaning up.
Process Process-6:
Traceback (most recent call last):
  File "/usr/lib64/python3.5/multiprocessing/process.py", line 252, in _bootstrap
    self.run()
  File "/usr/lib64/python3.5/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/bin/BuildConsensus.py", line 129, in processBCQueue
    consensus = cons_func(seq_list, **cons_args)
  File "/usr/lib/python3.5/site-packages/presto/Sequence.py", line 520, in qualityConsensus
    for c in qual_set}
  File "/usr/lib/python3.5/site-packages/presto/Sequence.py", line 520, in <dictcomp>
    for c in qual_set}
ZeroDivisionError: float division by zero
PID 157:  Error in sibling process detected. Cleaning up.
PID 175:  Error in sibling process detected. Cleaning up.
PID 163:  Error in sibling process detected. Cleaning up.
PID 173:  Error in sibling process detected. Cleaning up.
PID 154:  Error in sibling process detected. Cleaning up.

Enabling '--nproc 1' option made no improvement.

Here's our system environment. We're happy to provide more information if you need.

OS: CentOS Linux release 7.3.1611
Docker : version 1.12.6, build 3a094bd/1.12.6
Image: kleinstein/immcantation:1.7.0
ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


No problem at all.

Error fixed in 344222d.

ssnn-airr commented 6 years ago

Original comment by Koda Hirotomo (Bitbucket: [Hirotomo Koda](https://bitbucket.org/Hirotomo Koda), ).


Now we've tried the kleinstein/immcantation:devel image and everything works. I also appreciate your helpful advice on the AlignSets tool. This also works well on our data.

Thank you very much for all of this!

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Sure thing. I ran the new file you sent and it worked fine.

However, the consensus sequences are poor quality with a lot of Ns in them due to the abundance of indels. With --freg 0.6 BuildConsensus requires 60% of the reads to have the same non-N character or it'll assign an N. So you get a lot of situations like the following:

       ID> CATTATTACGTCGTGG
 SEQCOUNT> 9
   INSEQ1> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA
   INSEQ2> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA
   INSEQ3> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA
   INSEQ4> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA
   INSEQ5> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA
   INSEQ6> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA
   INSEQ7> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGGTGACAGAGGTCATCCGCAGGGAGA
   INSEQ8> TCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAAC
   INSEQ9> CTCGCACAGTAATAGACGGCCGTGTCCGCGACGTGACAGAGGTCATCCGCAGGGAGAA
CONSENSUS> TCGCACAGTAATAGACGGCCGTGTCCGCGACGNNNNNNNNGNNNNNCNNNNGGNNNAN
  QUALITY> ```````````!!!!!!!!{!!!!!{!!!!{{!!!{

Where a series of frameshifts causes a lot of positions to be Ns in the consensus.

I would run AlignSets-muscle on the files before running BuildConsensus.

The results looked a lot better when I did this (I'll email you example output separately).

ssnn-airr commented 6 years ago

Original comment by Koda Hirotomo (Bitbucket: [Hirotomo Koda](https://bitbucket.org/Hirotomo Koda), ).


Great. Just in case, I've sent you another example file which causes the same error. This time no N characters are included but there ARE some sequences with characters with quality 0, which passed QC step because average quality score is above the threshold. I hope you just try this file too. No problem for me on deadlines.

Thank you again!

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Thanks. I got the test data. The problem was caused by some UMI groups having positions where the quality score was 0 in all of the sequences (the tail end was all N characters with quality 0).

I added a fix to the version of presto at the tip of the default branch in the bitbucket repo.

As it looks like you are using the docker container, I'm rebuilding the kleinstein/immcantation:devel image right now, and you can get the fix by using that image. The image hasn't finished building yet. It should be done by 3pm your time.

I'll try to make an official release of presto with the fix sometime next week, but I have some testing to do on the MaskPrimers changes first.

ssnn-airr commented 6 years ago

Original comment by Koda Hirotomo (Bitbucket: [Hirotomo Koda](https://bitbucket.org/Hirotomo Koda), ).


Thank you for your quick reply! I've sent you an email with example files.

ssnn-airr commented 6 years ago

Original comment by Jason Vander Heiden (Bitbucket: javh, GitHub: javh).


Greetings @hkoda,

I'll take a look at this. Would you be able to email me an example file that fails? (jason.vanderheiden at yale.edu).

My guess is that there's a sequence that's entirely N characters, or something similar, in the data that we aren't accounting for correctly. I'm not sure without looking at it though.