innate2adaptive / Decombinator

Decombinator v4: fast, error-correcting analysis of TCR repertoires
https://innate2adaptive.github.io/Decombinator/
MIT License
22 stars 8 forks source link

ZeroDivisionError: division by zero #32

Open yub-hutch opened 4 months ago

yub-hutch commented 4 months ago

Hi,

The error "ZeroDivisionError: division by zero" is raised in Collapsinator.py, line 766, in collapsinator counts['pc_uniq_dcr_kept'] = ( counts['number_output_unique_dcrs'] / counts['number_input_unique_dcrs'] ).

And is the I8 spacer still supported by the current version? I still can pass I8 to -ol, but I'm not sure if it will work as expected.

Thanks!

MVCowley commented 4 months ago

Hi @yub-hutch,

Sounds like there aren't any decombined reads being output from decombinator, which could be for a number of reasons. If you could provide the log files from the decombinator run I can have a closer look.

What experimental protocol are you using to prepare your samples for sequencing? The I8 option (where the barcode region is structured I8-hexamer-I8-hexamer) should still work but we don't use it in-house anymore (we presently use either the m13 or i8_single options).

Thanks, Matt

yub-hutch commented 4 months ago

Thanks for your quick response, Matt!

We are using an existing dataset generated by Instrument: NextSeq 500, Strategy: AMPLICON, Source: TRANSCRIPTOMIC, Selection: RACE, & Layout: SINGLE. The oligo is SP2-I8-6N-I8-6N, as described in this paper https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2017.01267/full.

I have attached the three log files of the error: 2024_02_25_dcr_test_alpha_Collapsing_Summary.csv 2024_02_23_alpha_test_Decombinator_Summary.csv 2024_02_25_alpha_test_Decombinator_Summary.csv

Thanks!

MVCowley commented 4 months ago

No worries! Thanks very much for supplying the logs. The decombination step appears to be running fine. What version of the code are you using and which arguments are you supplying to Collapsonator?

Additionally if you wouldn't mind sharing the .n12 file, that would also be helpful.

Thanks, Matt

yub-hutch commented 4 months ago

Thanks! I was using v4.3.0 and the command was python ../Decombinator/dcr_pipeline.py -fq test.fastq -br R2 -bl 42 -c a -ol I8 -dz -dc. I reproduced the error after updating Decombinator to the latest version:

image image

The .n12 file is large and I uploaded it here https://www.dropbox.com/scl/fo/5mojgbu8mzuqhgxo5vbu5/h?rlkey=r82qqkwnm1hdk4ncfnjp1qpaq&dl=0.

Thanks for your help!

Bo

MVCowley commented 4 months ago

Hi Bo,

Thanks for sending these bits over. I think I see what the issue is now. As you can see from the screenshot, Collapsinator isn't finding any groups of reads. The reason for this is that it can't find the I8 barcode pattern in your reads.

Looking at your .n12 file, are you sure you are using the I8 barcode (I8-hexamer-I8-hexamer)? It looks like you are using the M13 barcode instead (M13-hexamer-I8-hexamer). You can see this if you ctrl-F your .n12 file for the M13 sequence (GTCGTGACTGGGAAAACCCTGG).

If having this oligomer is expected, you can get the pipeline to work by changing your -ol flag to m13 instead of I8. Collapsinator will then be able to find the barcode region.

Could you test this and let me know? If so, we might add a warning to instances like this to provide a more useful error message.

Thanks, Matt

yub-hutch commented 4 months ago

Hi, Matt,

It works if I change the spacer argument from I8 to M13. It seems the dataset I'm working on have different spacers for different samples. They may have the samples measured at multiple batches. After testing both spacers and comparing the results, I can determine the true spacer for most samples, and just discard the few samples with final outputs under both spacer specifications.

It would be very nice to give a warning to instances like this, because for those why analyze existing datasets, the spacer information may not be easily available. Another suggestion, if worth to implement, would be automatically determining the spacer when it is unavailable to the user.

Many thanks to your useful software and kind help.

Bo

MVCowley commented 4 months ago

Thanks for testing this for your use-case Bo. Happy to help and glad you found a solution. Interesting that the dataset contains instances of both sequencing strategies. Is it a publicly available dataset?

It would be very nice to give a warning to instances like this, because for those why analyze existing datasets, the spacer information may not be easily available. Another suggestion, if worth to implement, would be automatically determining the spacer when it is unavailable to the user.

Agreed that would be helpful. Will add it to the feature wishlist.

Thanks, Matt

yub-hutch commented 4 months ago

Thanks, Matt. The dataset is not public. It's from a large project that spans a long time, and that's why I guess they may sequence their samples in different batches.