ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
51 stars 14 forks source link

Unable to complete the read-simulator example: IndexError, KeyError. #126

Open giobus75 opened 1 month ago

giobus75 commented 1 month ago

Describe the bug I'm trying to generate a simulated dataset by using some different references and a configuration file like the one described in the examples of the README, but they both fail with different errors.

The first error is IndexError: index out of range; the second error (using the same configuration file but with a different reference) is KeyError: 'R_C'

To Reproduce

Error 1:

  1. Using a hg19 reference (I don't know where it was downloaded from)

  2. Using this configuration file neat-config.yaml: reference: references/hg19/hg19.fa read_len: 126 produce_bam: False produce_vcf: True paired_ended: True fragment_mean: 300 fragment_st_dev: 30

3: Run the simulation with: neat read-simulator -c neat_config.yml -o simulated_stuff

  1. Got this error:
    ...
    2024-09-28 05:20:26,819:INFO:neat.read_simulator.utils.generate_variants:Added 58809 mutations to chr19                                                                                                      
    2024-09-28 05:20:26,819:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
    2024-09-28 06:22:27,161:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 60.04 m                                                                                                    
    2024-09-28 06:22:27,163:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 62.01 m                                                                                                     
    2024-09-28 06:22:39,143:INFO:neat.read_simulator.runner:Generating variants for chr22                                                                                                                        
    2024-09-28 06:26:53,358:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 4.23 minutes                                                                                
    2024-09-28 06:26:53,359:INFO:neat.read_simulator.utils.generate_variants:Added 50994 mutations to chr22                                                                                                      
    2024-09-28 06:26:53,360:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
    2024-09-28 08:36:39,767:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 128.10 m                                                                                                   
    2024-09-28 08:36:39,771:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 129.77 m                                                                                                    
    2024-09-28 08:37:11,003:INFO:neat.read_simulator.runner:Generating variants for chr21                                                                                                                        
    2024-09-28 08:41:05,380:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 3.90 minutes                                                                                
    2024-09-28 08:41:05,381:INFO:neat.read_simulator.utils.generate_variants:Added 48576 mutations to chr21                                                                                                      
    2024-09-28 08:41:05,381:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
    2024-09-28 10:29:21,504:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 106.70 m                                                                                                   
    2024-09-28 10:29:21,508:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 108.27 m                                                                                                    
    2024-09-28 10:29:46,976:INFO:neat.read_simulator.runner:Generating variants for chr6_ssto_hap7                                                                                                               
    2024-09-28 10:29:52,116:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 0.08 minutes                                                                                
    2024-09-28 10:29:52,116:INFO:neat.read_simulator.utils.generate_variants:Added 4840 mutations to chr6_ssto_hap7                                                                                              
    2024-09-28 10:29:52,117:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
    2024-09-28 10:30:48,740:ERROR:neat:read-simulator failed, see the traceback below                                                                                                                            
    Traceback (most recent call last):                                                                                                                                                                           
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main                                                                                                                
    cmd(args)                                                                                                                                                                                                
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute                                                                                          
    read_simulator_runner(arguments.config, arguments.output)                                                                                                                                                
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 314, in read_simulator_runner                                                                                 
    read1_fastq_paired, read1_fastq_single, read2_fastq_paired, read2_fastq_single = generate_reads(                                                                                                         
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_reads.py", line 345, in generate_reads                                                                          
    read_1.finalize_read_and_write(                                                                                                                                                                          
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/read.py", line 334, in finalize_read_and_write                                                                           
    self.errors, self.padding = err_model.get_sequencing_errors(                                                                                                                                             
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/error_models.py", line 241, in get_sequencing_errors                                                                                   
    snv_reference = reference_segment[index]                                                                                                                                                                 
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__                                                                                                              
    return chr(self._data[index])                                                                                                                                                                            
    IndexError: index out of range                                                                                                                                                                               
    ERROR: read-simulator failed, showing the last error                                                                                                                                                         
    Traceback (most recent call last):                                                                                                                                                                           
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main                                                                                                                
    cmd(args)                                                                                         
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute                                                                                          
    read_simulator_runner(arguments.config, arguments.output)                                                                                                                                                
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 314, in read_simulator_runner                                                                                 
    read1_fastq_paired, read1_fastq_single, read2_fastq_paired, read2_fastq_single = generate_reads(                                                                                                         
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_reads.py", line 345, in generate_reads
    read_1.finalize_read_and_write(                                                                   
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/read.py", line 334, in finalize_read_and_write
    self.errors, self.padding = err_model.get_sequencing_errors(                                                                                                                                             
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/error_models.py", line 241, in get_sequencing_errors                                                                                   
    snv_reference = reference_segment[index]                                                                                                                                                                 
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__                                                                                                              
    return chr(self._data[index])                                                                                                                                                                            
    IndexError: index out of range   

Error 2:

  1. Download the reference: wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

  2. Using this configuration file neat-config.yaml: reference: references/GRCh38_full_analysis_set_plus_decoy_hla.fa read_len: 126 produce_bam: False produce_vcf: True paired_ended: True fragment_mean: 300 fragment_st_dev: 30

  3. Run the simulation with: neat read-simulator -c neat_config.yml -o simulated_stuff

  4. Got this error:

...
2024-10-02 08:05:01,904:INFO:neat.read_simulator.utils.generate_variants:Added 181608 mutations to chr5                                                                                                      
2024-10-02 08:05:01,904:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
2024-10-02 10:38:55,523:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 147.51 m                                                                                                   
2024-10-02 10:38:55,525:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 153.89 m                                                                                                    
2024-10-02 10:39:21,967:INFO:neat.read_simulator.runner:Generating variants for chr6                                                                                                                         
2024-10-02 11:28:15,402:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 48.86 minutes                                                                               
2024-10-02 11:28:15,406:INFO:neat.read_simulator.utils.generate_variants:Added 171458 mutations to chr6                                                                                                      
2024-10-02 11:28:15,406:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
2024-10-02 13:42:35,951:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 128.39 m                                                                                                   
2024-10-02 13:42:35,953:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 134.34 m                                                                                                    
2024-10-02 13:42:58,778:INFO:neat.read_simulator.runner:Generating variants for chr7                                                                                                                         
2024-10-02 14:12:46,750:ERROR:neat:read-simulator failed, see the traceback below                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                           
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main                                                                                                                
    cmd(args)                                                                                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute                                                                                          
    read_simulator_runner(arguments.config, arguments.output)                                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 301, in read_simulator_runner                                                                                 
    local_variants = generate_variants(                                                                                                                                                                      
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_variants.py", line 210, in generate_variants                                                                    
    temp_variant = mutation_model.generate_snv(trinuc, location, options.rng)                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/mutation_model.py", line 109, in generate_snv                                                                                          
    transition_matrix = self.trinuc_trans_matrices[DINUC_IND[trinucleotide[0] + "_" + trinucleotide[2]]]                                                                                                     
KeyError: 'R_C'                                                                                                                                                                                              
ERROR: read-simulator failed, showing the last error                                                                                                                                                         
Traceback (most recent call last):                                                                                                                                                                           
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main                                                                                                                
    cmd(args)                                                                                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute                                                                                          
    read_simulator_runner(arguments.config, arguments.output)                                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 301, in read_simulator_runner                                                                                 
    local_variants = generate_variants(                                                               
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_variants.py", line 210, in generate_variants                                                                    
    temp_variant = mutation_model.generate_snv(trinuc, location, options.rng)                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/mutation_model.py", line 109, in generate_snv                                                                                          
    transition_matrix = self.trinuc_trans_matrices[DINUC_IND[trinucleotide[0] + "_" + trinucleotide[2]]]                                                                                                     
KeyError: 'R_C'  

Expected behavior Have a vcf output file with simulated data

Desktop (please complete the following information):

Additional context I ran the neat read-simulator within a Docker container. I enter the container, activate the env neat by using conda and ran the simulation.

The Docker image was created by using this Dockerfile:

ARG CondaTag=24.3.0-0
ARG NeatTag=4.2.3

FROM condaforge/mambaforge:${CondaTag}

# Follow some of the tips from https://jcristharif.com/conda-docker-tips.html
ENV PYTHONDONTWRITEBYTECODE=true

# cache directory: we should be using ${CONDA_PREFIX} from within the
# container image, but I don't know how to get that value here,
# so we're hard-coding the path and assuming that CONDA_PREFIX has
# value /opt/conda/pkgs
ENV CONDA_PREFIX=/opt/conda
ARG conda_pkgs_dir=${CONDA_PREFIX}/pkgs

WORKDIR /tmp

RUN \
  --mount=type=cache,target=${conda_pkgs_dir},sharing=locked \
  --mount=type=cache,target=/tmp \
     ([ -d /tmp/NEAT-source ] || git clone --filter=tree:0 https://github.com/ncsa/NEAT.git /tmp/NEAT-source) \
  && cd /tmp/NEAT-source \
  && git checkout ${NeatTag} \
  && mamba env create --file environment.yml --name neat --no-default-packages \
  && conda run --live-stream --name neat poetry build \
  && conda run --live-stream --name neat pip install --no-cache-dir dist/neat*whl

RUN adduser neat
USER neat
WORKDIR /home/neat
joshfactorial commented 1 month ago

Currently working on this. Will post a fix/new version soon!

joshfactorial commented 1 month ago

One thing I'm noticing is that it's stumbling on the newest genome assembly, because of the inclusion of characters other than A, C, G, T, N. We will have to update the code to generalize these alternate characters, but currently we're unclear the best way to handle those is. It might be worth trying to replace non ACTG with N and see if that resolves part of the problem. I think we still have a stray indexing error, though, that crops up sometimes.

giobus75 commented 1 month ago

Thank you for your fast response. I'll try to follow your workaround replacing ACTG with N.

giobus75 commented 1 month ago

Hi, I replaced not-ACTGN chars with N but it still returns an Index out of range error.

I used this code to replace chars:

fn = "../references/GRCh38_mod_with_N.fa"
out_fn = "../references/GRCh38_mod_with_N_replaced.fa"

with open (fn) as fd:
    buff = fd.readlines()

new_buff = []
for i, l in enumerate(buff):
    if "chr" not in l and "HLA" not in l:
        l_upper = l.upper()
        l = l_upper.translate(str.maketrans({'a': 'A', 'g': 'g', 'c': 'C', 't': 'T', 'M': 'N', 'R': 'N', 'Y': 'N', 'W': 'N', 'B': 'N', 'S': 'N', 'K': 'N'}))

    new_buff.append(l)

with open(out_fn, "w") as fd:
    for i, l in enumerate(new_buff):
        fd.write(l)

Then I ran the simulation with the modified reference file:

neat read-simulator -c neat_config.yml -o simulated_stuff

And I got:

 2024-10-08 15:08:23,709:INFO:neat.read_simulator.utils.generate_variants:Added 144796 mutations to chr8
2024-10-08 15:08:23,709:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...
2024-10-08 16:13:45,016:ERROR:neat:read-simulator failed, see the traceback below
Traceback (most recent call last):
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main
    cmd(args)
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute
    read_simulator_runner(arguments.config, arguments.output)
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 314, in read_simulator_runner
    read1_fastq_paired, read1_fastq_single, read2_fastq_paired, read2_fastq_single = generate_reads(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_reads.py", line 345, in generate_reads
    read_1.finalize_read_and_write(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/read.py", line 334, in finalize_read_and_write
    self.errors, self.padding = err_model.get_sequencing_errors(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/error_models.py", line 241, in get_sequencing_errors
    snv_reference = reference_segment[index]
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__
    return chr(self._data[index])
IndexError: index out of range
ERROR: read-simulator failed, showing the last error
Traceback (most recent call last):
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main
    cmd(args)
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute
    read_simulator_runner(arguments.config, arguments.output)
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 314, in read_simulator_runner
    read1_fastq_paired, read1_fastq_single, read2_fastq_paired, read2_fastq_single = generate_reads(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_reads.py", line 345, in generate_reads
    read_1.finalize_read_and_write(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/read.py", line 334, in finalize_read_and_write
    self.errors, self.padding = err_model.get_sequencing_errors(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/error_models.py", line 241, in get_sequencing_errors
    snv_reference = reference_segment[index]
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__
    return chr(self._data[index])
IndexError: index out of range
joshfactorial commented 1 month ago

All right. I will look into this!

joshfactorial commented 1 month ago

Still working on this. It turned out to have a few subtlties that are tricky. We will try to post a fix this week.

giobus75 commented 1 month ago

Thanks so much! I really appreciate it!

joshfactorial commented 1 month ago

All right, I'm not yet able to reproduce this error with the latest version. It's possible it was related to another bug I fixed in the sequencing error section, since that is the part throwing the error here, so maybe it is fixed, if you want to try the latest version and let us know the results.

joshfactorial commented 1 month ago

My test was to run specifically on chr8 from that same file. It was taking too long to run the whole genome, for me. So it's possible there's some other issue still.

giobus75 commented 1 month ago

Hi,

I ran another read simulation using the v4.2.6 tag, but unfortunately, I'm still encountering an "Index out of range" error. I've attached the log from the run in case it helps diagnose the issue.

The error occurs after more than two days of simulation, which makes troubleshooting quite time-consuming. Is there a faster way to check the code or perhaps an option that I might have overlooked?

1729519334.699251_NEAT.log

joshfactorial commented 1 month ago

Okay, We did have a small bug that we fixed in 4.2.7 but it was related to quality scores. this is failing before that step. I will try again to replicate this.

giobus75 commented 3 weeks ago

Hi, I tried the 4.2.7 and, running the same simulation, the error occured earlier (less than 4 hours).

The log: 1730662388.9120936_NEAT.log

joshfactorial commented 3 weeks ago

All right, I will dive into that this week, hopefully.

SnehaGummadi commented 1 week ago

Hi, I tried the 4.2.7 and, running the same simulation, the error occured earlier (less than 4 hours).

The log: 1730662388.9120936_NEAT.log

I too am having this same error with the 4.2.7 version:

2024-11-16 11:56:14,001:INFO:neat.common.logging:writing log to: {base_dir}/NEAT/1731779766.7208364_NEAT.log
2024-11-16 11:56:14,011:INFO:neat.read_simulator.runner:Using configuration file paired75.yml
2024-11-16 11:56:14,012:INFO:neat.read_simulator.runner:Saving output files to .
2024-11-16 11:56:14,015:INFO:neat.read_simulator.utils.options:Run Configuration...
2024-11-16 11:56:14,015:INFO:neat.read_simulator.utils.options:Input fasta: {base_dir}/NEAT-chimeric/reference_files/chr18_smallest.fa
2024-11-16 11:56:14,015:INFO:neat.read_simulator.utils.options:Producing the following files:
    - {base_dir}/NEAT/paired_75_x10_r1.fastq.gz
    - {base_dir}/NEAT/paired_75_x10_r2.fastq.gz

2024-11-16 11:56:14,015:INFO:neat.read_simulator.utils.options:Single threading - 1 thread.
2024-11-16 11:56:14,015:INFO:neat.read_simulator.utils.options:Using a read length of 75
2024-11-16 11:56:14,016:INFO:neat.read_simulator.utils.options:Generating fragments based on mean=300, stand. dev=100
2024-11-16 11:56:14,016:INFO:neat.read_simulator.utils.options:Running in paired-ended mode.
2024-11-16 11:56:14,016:INFO:neat.read_simulator.utils.options:Average coverage: 10
2024-11-16 11:56:14,016:INFO:neat.read_simulator.utils.options:Using default error model.
2024-11-16 11:56:14,016:INFO:neat.read_simulator.utils.options:User defined average sequencing error rate: 0.001.
2024-11-16 11:56:14,016:INFO:neat.read_simulator.utils.options:Ploidy value: 1
2024-11-16 11:56:14,016:INFO:neat.read_simulator.utils.options:Custom average mutation rate for the run: 0.01
2024-11-16 11:56:14,016:INFO:neat.read_simulator.utils.options:RNG seed value for run: 4617953855961099
2024-11-16 11:56:14,016:INFO:neat.read_simulator.runner:Reading Models...
2024-11-16 11:56:14,017:INFO:neat.read_simulator.runner:Reading {base_dir}/NEAT-chimeric/reference_files/chr18_smallest.fa.
2024-11-16 11:56:14,994:INFO:neat.read_simulator.runner:Beginning simulation.
2024-11-16 11:56:15,226:INFO:neat.read_simulator.runner:Generating variants for chr18
2024-11-16 12:53:06,964:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 56.86 minutes
2024-11-16 12:53:06,991:INFO:neat.read_simulator.utils.generate_variants:Added 113797 mutations to chr18
2024-11-16 12:53:06,991:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...
2024-11-16 12:56:24,853:ERROR:neat:read-simulator failed, see the traceback below
Traceback (most recent call last):
  File "{base_dir}/NEAT/neat/cli/cli.py", line 131, in main
    cmd(args)
  File "{base_dir}/NEAT/neat/cli/commands/read_simulator.py", line 47, in execute
    read_simulator_runner(arguments.config, arguments.output)
  File "{base_dir}/NEAT/neat/read_simulator/runner.py", line 313, in read_simulator_runner
    read1_fastq_paired, read1_fastq_single, read2_fastq_paired, read2_fastq_single = generate_reads(
  File "{base_dir}/NEAT/neat/read_simulator/utils/generate_reads.py", line 383, in generate_reads
    read_2.finalize_read_and_write(
  File "{base_dir}/NEAT/neat/read_simulator/utils/read.py", line 334, in finalize_read_and_write
    self.errors, self.padding = err_model.get_sequencing_errors(
  File "{base_dir}/NEAT/neat/models/error_models.py", line 193, in get_sequencing_errors
    if rng.random() < quality_score_error_rate[quality_scores[i]]:
IndexError: index 90 is out of bounds for axis 0 with size 90
joshfactorial commented 1 week ago

Sorry to hear. I will take a look as soon as I can.