ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
49 stars 14 forks source link

Unable to complete the read-simulator example: IndexError, KeyError. #126

Open giobus75 opened 1 week ago

giobus75 commented 1 week ago

Describe the bug I'm trying to generate a simulated dataset by using some different references and a configuration file like the one described in the examples of the README, but they both fail with different errors.

The first error is IndexError: index out of range; the second error (using the same configuration file but with a different reference) is KeyError: 'R_C'

To Reproduce

Error 1:

  1. Using a hg19 reference (I don't know where it was downloaded from)

  2. Using this configuration file neat-config.yaml: reference: references/hg19/hg19.fa read_len: 126 produce_bam: False produce_vcf: True paired_ended: True fragment_mean: 300 fragment_st_dev: 30

3: Run the simulation with: neat read-simulator -c neat_config.yml -o simulated_stuff

  1. Got this error:
    ...
    2024-09-28 05:20:26,819:INFO:neat.read_simulator.utils.generate_variants:Added 58809 mutations to chr19                                                                                                      
    2024-09-28 05:20:26,819:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
    2024-09-28 06:22:27,161:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 60.04 m                                                                                                    
    2024-09-28 06:22:27,163:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 62.01 m                                                                                                     
    2024-09-28 06:22:39,143:INFO:neat.read_simulator.runner:Generating variants for chr22                                                                                                                        
    2024-09-28 06:26:53,358:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 4.23 minutes                                                                                
    2024-09-28 06:26:53,359:INFO:neat.read_simulator.utils.generate_variants:Added 50994 mutations to chr22                                                                                                      
    2024-09-28 06:26:53,360:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
    2024-09-28 08:36:39,767:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 128.10 m                                                                                                   
    2024-09-28 08:36:39,771:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 129.77 m                                                                                                    
    2024-09-28 08:37:11,003:INFO:neat.read_simulator.runner:Generating variants for chr21                                                                                                                        
    2024-09-28 08:41:05,380:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 3.90 minutes                                                                                
    2024-09-28 08:41:05,381:INFO:neat.read_simulator.utils.generate_variants:Added 48576 mutations to chr21                                                                                                      
    2024-09-28 08:41:05,381:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
    2024-09-28 10:29:21,504:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 106.70 m                                                                                                   
    2024-09-28 10:29:21,508:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 108.27 m                                                                                                    
    2024-09-28 10:29:46,976:INFO:neat.read_simulator.runner:Generating variants for chr6_ssto_hap7                                                                                                               
    2024-09-28 10:29:52,116:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 0.08 minutes                                                                                
    2024-09-28 10:29:52,116:INFO:neat.read_simulator.utils.generate_variants:Added 4840 mutations to chr6_ssto_hap7                                                                                              
    2024-09-28 10:29:52,117:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
    2024-09-28 10:30:48,740:ERROR:neat:read-simulator failed, see the traceback below                                                                                                                            
    Traceback (most recent call last):                                                                                                                                                                           
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main                                                                                                                
    cmd(args)                                                                                                                                                                                                
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute                                                                                          
    read_simulator_runner(arguments.config, arguments.output)                                                                                                                                                
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 314, in read_simulator_runner                                                                                 
    read1_fastq_paired, read1_fastq_single, read2_fastq_paired, read2_fastq_single = generate_reads(                                                                                                         
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_reads.py", line 345, in generate_reads                                                                          
    read_1.finalize_read_and_write(                                                                                                                                                                          
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/read.py", line 334, in finalize_read_and_write                                                                           
    self.errors, self.padding = err_model.get_sequencing_errors(                                                                                                                                             
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/error_models.py", line 241, in get_sequencing_errors                                                                                   
    snv_reference = reference_segment[index]                                                                                                                                                                 
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__                                                                                                              
    return chr(self._data[index])                                                                                                                                                                            
    IndexError: index out of range                                                                                                                                                                               
    ERROR: read-simulator failed, showing the last error                                                                                                                                                         
    Traceback (most recent call last):                                                                                                                                                                           
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main                                                                                                                
    cmd(args)                                                                                         
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute                                                                                          
    read_simulator_runner(arguments.config, arguments.output)                                                                                                                                                
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 314, in read_simulator_runner                                                                                 
    read1_fastq_paired, read1_fastq_single, read2_fastq_paired, read2_fastq_single = generate_reads(                                                                                                         
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_reads.py", line 345, in generate_reads
    read_1.finalize_read_and_write(                                                                   
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/read.py", line 334, in finalize_read_and_write
    self.errors, self.padding = err_model.get_sequencing_errors(                                                                                                                                             
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/error_models.py", line 241, in get_sequencing_errors                                                                                   
    snv_reference = reference_segment[index]                                                                                                                                                                 
    File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__                                                                                                              
    return chr(self._data[index])                                                                                                                                                                            
    IndexError: index out of range   

Error 2:

  1. Download the reference: wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

  2. Using this configuration file neat-config.yaml: reference: references/GRCh38_full_analysis_set_plus_decoy_hla.fa read_len: 126 produce_bam: False produce_vcf: True paired_ended: True fragment_mean: 300 fragment_st_dev: 30

  3. Run the simulation with: neat read-simulator -c neat_config.yml -o simulated_stuff

  4. Got this error:

...
2024-10-02 08:05:01,904:INFO:neat.read_simulator.utils.generate_variants:Added 181608 mutations to chr5                                                                                                      
2024-10-02 08:05:01,904:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
2024-10-02 10:38:55,523:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 147.51 m                                                                                                   
2024-10-02 10:38:55,525:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 153.89 m                                                                                                    
2024-10-02 10:39:21,967:INFO:neat.read_simulator.runner:Generating variants for chr6                                                                                                                         
2024-10-02 11:28:15,402:INFO:neat.read_simulator.utils.generate_variants:Finished generating random mutations in 48.86 minutes                                                                               
2024-10-02 11:28:15,406:INFO:neat.read_simulator.utils.generate_variants:Added 171458 mutations to chr6                                                                                                      
2024-10-02 11:28:15,406:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...                                                                                                                      
2024-10-02 13:42:35,951:INFO:neat.read_simulator.utils.generate_reads:Contig fastq(s) written in: 128.39 m                                                                                                   
2024-10-02 13:42:35,953:INFO:neat.read_simulator.utils.generate_reads:Finished sampling reads in 134.34 m                                                                                                    
2024-10-02 13:42:58,778:INFO:neat.read_simulator.runner:Generating variants for chr7                                                                                                                         
2024-10-02 14:12:46,750:ERROR:neat:read-simulator failed, see the traceback below                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                           
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main                                                                                                                
    cmd(args)                                                                                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute                                                                                          
    read_simulator_runner(arguments.config, arguments.output)                                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 301, in read_simulator_runner                                                                                 
    local_variants = generate_variants(                                                                                                                                                                      
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_variants.py", line 210, in generate_variants                                                                    
    temp_variant = mutation_model.generate_snv(trinuc, location, options.rng)                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/mutation_model.py", line 109, in generate_snv                                                                                          
    transition_matrix = self.trinuc_trans_matrices[DINUC_IND[trinucleotide[0] + "_" + trinucleotide[2]]]                                                                                                     
KeyError: 'R_C'                                                                                                                                                                                              
ERROR: read-simulator failed, showing the last error                                                                                                                                                         
Traceback (most recent call last):                                                                                                                                                                           
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main                                                                                                                
    cmd(args)                                                                                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute                                                                                          
    read_simulator_runner(arguments.config, arguments.output)                                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 301, in read_simulator_runner                                                                                 
    local_variants = generate_variants(                                                               
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_variants.py", line 210, in generate_variants                                                                    
    temp_variant = mutation_model.generate_snv(trinuc, location, options.rng)                                                                                                                                
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/mutation_model.py", line 109, in generate_snv                                                                                          
    transition_matrix = self.trinuc_trans_matrices[DINUC_IND[trinucleotide[0] + "_" + trinucleotide[2]]]                                                                                                     
KeyError: 'R_C'  

Expected behavior Have a vcf output file with simulated data

Desktop (please complete the following information):

Additional context I ran the neat read-simulator within a Docker container. I enter the container, activate the env neat by using conda and ran the simulation.

The Docker image was created by using this Dockerfile:

ARG CondaTag=24.3.0-0
ARG NeatTag=4.2.3

FROM condaforge/mambaforge:${CondaTag}

# Follow some of the tips from https://jcristharif.com/conda-docker-tips.html
ENV PYTHONDONTWRITEBYTECODE=true

# cache directory: we should be using ${CONDA_PREFIX} from within the
# container image, but I don't know how to get that value here,
# so we're hard-coding the path and assuming that CONDA_PREFIX has
# value /opt/conda/pkgs
ENV CONDA_PREFIX=/opt/conda
ARG conda_pkgs_dir=${CONDA_PREFIX}/pkgs

WORKDIR /tmp

RUN \
  --mount=type=cache,target=${conda_pkgs_dir},sharing=locked \
  --mount=type=cache,target=/tmp \
     ([ -d /tmp/NEAT-source ] || git clone --filter=tree:0 https://github.com/ncsa/NEAT.git /tmp/NEAT-source) \
  && cd /tmp/NEAT-source \
  && git checkout ${NeatTag} \
  && mamba env create --file environment.yml --name neat --no-default-packages \
  && conda run --live-stream --name neat poetry build \
  && conda run --live-stream --name neat pip install --no-cache-dir dist/neat*whl

RUN adduser neat
USER neat
WORKDIR /home/neat
joshfactorial commented 1 week ago

Currently working on this. Will post a fix/new version soon!

joshfactorial commented 1 week ago

One thing I'm noticing is that it's stumbling on the newest genome assembly, because of the inclusion of characters other than A, C, G, T, N. We will have to update the code to generalize these alternate characters, but currently we're unclear the best way to handle those is. It might be worth trying to replace non ACTG with N and see if that resolves part of the problem. I think we still have a stray indexing error, though, that crops up sometimes.

giobus75 commented 6 days ago

Thank you for your fast response. I'll try to follow your workaround replacing ACTG with N.

giobus75 commented 1 day ago

Hi, I replaced not-ACTGN chars with N but it still returns an Index out of range error.

I used this code to replace chars:

fn = "../references/GRCh38_mod_with_N.fa"
out_fn = "../references/GRCh38_mod_with_N_replaced.fa"

with open (fn) as fd:
    buff = fd.readlines()

new_buff = []
for i, l in enumerate(buff):
    if "chr" not in l and "HLA" not in l:
        l_upper = l.upper()
        l = l_upper.translate(str.maketrans({'a': 'A', 'g': 'g', 'c': 'C', 't': 'T', 'M': 'N', 'R': 'N', 'Y': 'N', 'W': 'N', 'B': 'N', 'S': 'N', 'K': 'N'}))

    new_buff.append(l)

with open(out_fn, "w") as fd:
    for i, l in enumerate(new_buff):
        fd.write(l)

Then I ran the simulation with the modified reference file:

neat read-simulator -c neat_config.yml -o simulated_stuff

And I got:

 2024-10-08 15:08:23,709:INFO:neat.read_simulator.utils.generate_variants:Added 144796 mutations to chr8
2024-10-08 15:08:23,709:INFO:neat.read_simulator.utils.generate_reads:Sampling reads...
2024-10-08 16:13:45,016:ERROR:neat:read-simulator failed, see the traceback below
Traceback (most recent call last):
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main
    cmd(args)
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute
    read_simulator_runner(arguments.config, arguments.output)
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 314, in read_simulator_runner
    read1_fastq_paired, read1_fastq_single, read2_fastq_paired, read2_fastq_single = generate_reads(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_reads.py", line 345, in generate_reads
    read_1.finalize_read_and_write(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/read.py", line 334, in finalize_read_and_write
    self.errors, self.padding = err_model.get_sequencing_errors(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/error_models.py", line 241, in get_sequencing_errors
    snv_reference = reference_segment[index]
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__
    return chr(self._data[index])
IndexError: index out of range
ERROR: read-simulator failed, showing the last error
Traceback (most recent call last):
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/cli.py", line 131, in main
    cmd(args)
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/cli/commands/read_simulator.py", line 47, in execute
    read_simulator_runner(arguments.config, arguments.output)
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/runner.py", line 314, in read_simulator_runner
    read1_fastq_paired, read1_fastq_single, read2_fastq_paired, read2_fastq_single = generate_reads(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/generate_reads.py", line 345, in generate_reads
    read_1.finalize_read_and_write(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/read_simulator/utils/read.py", line 334, in finalize_read_and_write
    self.errors, self.padding = err_model.get_sequencing_errors(
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/neat/models/error_models.py", line 241, in get_sequencing_errors
    snv_reference = reference_segment[index]
  File "/opt/conda/envs/neat/lib/python3.10/site-packages/Bio/Seq.py", line 430, in __getitem__
    return chr(self._data[index])
IndexError: index out of range
joshfactorial commented 1 day ago

All right. I will look into this!