ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
38 stars 12 forks source link

IndexError when generating artificial tumor reads #49

Closed aldosr closed 1 month ago

aldosr commented 2 years ago

Hi, I'm trying to generate tumor reads using the following parameters:

gen_reads.py -r hg19.fa \
        -R 150 \
        --pe-model ${FRAGLENMODEL} \
        -c 2500 \
        -e ${SEQERRORMODEL} \
        --gc-model ${GCBIASMODEL} \
        -tr ${BED} \
        -m ${MUTMODEL} \
        -o output

but I'm having this error:

 File "/path/to/NEAT/gen_reads.py", line 896, in <module>
    main()
  File "/path/to/NEAT/gen_reads.py", line 618, in main
    all_inserted_variants = sequences.random_mutations()
  File "/path/to/NEAT/source/SequenceContainer.py", line 546, in random_mutations
    if self.black_list[p][k]:
IndexError: index 18001 is out of bounds for axis 0 with size 18001

The mutational model was generated using an assembled vcf from a batch of tumor samples. The fragment length model and the gc bias model were generated from a tumor BAM, extracted from the tumor batch used for the mutational model. The sequencing model was generated from the same sample fastqs. BED file comprises a series of genes of interest.

The error didn't appear when the script was tested with a BED file of a single gene (TP53) and using hg38 as a reference. One solution can be to add a:

try: code except IndexError: continue But I'd like your input about the error. Thanks a lot for the tool, anyway!

joshfactorial commented 2 years ago

I'm not able to reproduce this error. What version of python are you running? Adding a try/except around that line will break the code downstream, is it depends on the blacklist to avoid collisions.

joshfactorial commented 2 years ago

Other details that might be useful: Can you supply the full output of the command up to the traceback? It would be helpful to see how far it got before hitting this error. The fact that the error happened in one bed file and not the other could be significant. It might be helpful to see those bed files, or at least a few sample lines. It might also be worth seeing if there are any key differences in the fasta files that could be causing the code to trip up, such as differences in the way the chromosomes are named.

aldosr commented 2 years ago

Hi Josh, Thanks for the detailed answers. I'm actually re-generating the tumor datasets using a cleaned BED file (I noticed an error in the BED file that could be the cause of the IndexError). As soon as the datasets are ready I will let you know if the problem persists. Thanks again.

aldosr commented 2 years ago

Hi again Josh, I generated the datasets again using the cleaned BED file, but the error persists. This is the complete output:

Warning: Read length of error model (149) does not match -R value (150), rescaling model...
Using empirical fragment length distribution.
found index /path/to/hg19.fa.fai
Warning: Reference contains sequences not found in targeted regions BED file.
reading chrM...
0.011 (sec)
--------------------------------
sampling reads...
[]
Read sampling completed in 0 (sec)
reading chr1...
100.239 (sec)
--------------------------------
sampling reads...
[----------]
Read sampling completed in 7379 (sec)
reading chr2...
99.104 (sec)
--------------------------------
sampling reads...
[-------Traceback (most recent call last):
  File "/path/to/NEAT/gen_reads.py", line 896, in <module>
    main()
  File "/path/to/NEAT/gen_reads.py", line 618, in main
    all_inserted_variants = sequences.random_mutations()
  File "/path/to/NEAT/source/SequenceContainer.py", line 545, in random_mutations
    if self.black_list[p][k]:
IndexError: index 18001 is out of bounds for axis 0 with size 18001

Interestingly, another dataset, ran in parallel using a bash script and SLURM, computated until chr5:

reading chr5...
60.400 (sec)
--------------------------------
sampling reads...
[-Traceback (most recent call last):
  File "/path/to/NEAT/gen_reads.py", line 896, in <module>
    main()
  File "/path/to/NEAT/gen_reads.py", line 618, in main
    all_inserted_variants = sequences.random_mutations()
  File "/path/to/NEAT/source/SequenceContainer.py", line 545, in random_mutations
    if self.black_list[p][k]:
IndexError: index 18006 is out of bounds for axis 0 with size 18006

These are the first lines of the input BED:

chr1    11601   11725
chr1    11736   11860
chr1    20506   20638
chr1    29803   29957
chr1    36829   36964
chr1    46836   47053
chr1    95424   95603
chr1    96959   97080

Please let me know if you need the entire BED or other info.

I'm putting @lbeltrame in cc since it's working with me on this.

joshfactorial commented 1 month ago

Check out the new version 4.2 and see if it works for you!