ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
51 stars 14 forks source link

Processing of off-target reference regions using target bed file #114

Open dani-ture opened 5 months ago

dani-ture commented 5 months ago

Describe the bug

It seems like neat processes regions in the reference that I did not explicitly include in the target bed file. I want to simulate reads for some cancer related genes using the full human reference genome and a target bed file, but before using the complete bed file I wanted to test it with just 2 regions of chromosome 2. However, it seems like neat generates random mutations and samples reads for other chromosomes too. I don’t know if they will be written to the fastq file in the end, but these unnecessary steps makes the process much slower. Additionally, neat also generated mutations for the whole chromosome 2 and then filtered out the ones that landed in regions that were not included in the bed file.

To Reproduce

Steps to reproduce the behavior:

  1. Download the latest human reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/

  2. Make a copy of the provided template config file (I called it test_config.yml) and set the parameters:

    ‘’’reference:

    target_bed:

    rng_seed: 6386514007882411’’’

    The rest are left with the “.” as default.

  3. Run neat on the command line:neat --log-name test --log-detail HIGH --log-level DEBUG read-simulator -c test_config.yml -o test

Expected behavior

Maybe process just targeted regions to accelerate operations.

Additional comments

20240614_error

Desktop:

joshfactorial commented 5 months ago

Thanks for submitting this bug, I will take a look hopefully this weekend.

joshfactorial commented 5 months ago

I think this ticket is fixed now. Please reopen if you have further issues.

dani-ture commented 4 months ago

Hi, thanks for the new release, all other issues seem to have been fixed. However I am still having problems with this one.

joshfactorial commented 4 months ago

are you using a publicly available bed file? Or can you send me a sample of the bed file?

dani-ture commented 4 months ago

cancer_exons_biomart_mini.txt

joshfactorial commented 4 months ago

Right, well, it still needs to generate reads, even if there are no mutations (unless you skip fastq and bam files). The way it works is it generates variants first, then generates reads, to which it adds the variants (and random sequencing errors). The sampling reads and covering datasets functions are related to generating the reads themselves. It does check the bed files when generating reads, and throws out any reads that fall outside of the bed regions. It does generate variants for the entire dataset as a first pass. It seemed like that part went fast enough that it wasn't critical to speed up, but we haven't tested on a full human dataset yet. I can look into filtering out bed regions. But for a given read, there may be differences from the reference (due to sequencing errors), but if you want to by pass read generation completely, you can try setting produce_fastq and produce_bam to false. I think the final vcf output will be filtered for the bam file (or it should be). I will look into generating the mutations, in terms of being able to skip areas that aren't in a bed file, to see if that speeds things up.

dani-ture commented 4 months ago

Thank you so much for explaining in detail. As you mentioned, generating mutations does not take that long, the critical step is the read generation one. It would be ideal if neat focused on generating reads directly in the bed regions instead of covering the full genome/chromosome and then discarding those reads that do not land in the bed regions. I will try what you suggest, but I really wanted the fastq file to benchmark a variant calling pipeline and see if it agrees with my input_vcf and neat's ground truth.

joshfactorial commented 4 months ago

Right, the problem we keep running into is fragment sizes. For small fragments, coverage gets stuck trying to generate enough valid reads to cover the dataset, especially in paired-ended mode. The initial solution then was to generate end points for all the reads, then throw out the ones we don't need before doing any actual processing. This method may not work as well for longer chromosomes, because it requires looping through the length of the chromosome. Trying to find a balance or a solution that works for both long and short read regions is proving to be a challenge.