ncsa / NEAT

NEAT (NExt-generation Analysis Toolkit) simulates next-gen sequencing reads and can learn simulation parameters from real data.
Other
46 stars 14 forks source link

Error when generating variants where there is a degenerate symbol in the reference #122

Open dani-ture opened 2 months ago

dani-ture commented 2 months ago

Describe the bug

It looks like when neat was generating variants, it found by chance a “Y” in the reference sequence and aborted the variant generation process.

To Reproduce

Steps to reproduce the behavior:

  1. Download the latest human reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/

  2. Make a copy of the provided template config file (I called it test_config_human.yml) and set the parameters:

    ‘’’reference:

    target_bed:

    produce_vcf: true

    produce_fastq: false

    rng_seed: 6386514007882411’’’

    The rest are left with the “.” as default.

  3. Run neat on the command line:neat --log-name test --log-detail HIGH --log-level DEBUG read-simulator -c test_config_human.yml -o test

Expected behavior

Generate variants and output them to a vcf file.

Desktop:

image

joshfactorial commented 2 months ago

Yeah, I've never seen a "Y" in the reference before. I can investigate how to handle that. For now I would just just do something like sed -i 's/Y/N/g' genome.fa to swap out Y's for N's and see if it runs okay.

dani-ture commented 2 months ago

I've been inspecting the ref files and apparently there are some ambiguous characters like K, Y, M, R, W... I guess I'll just have to preprocess them as you suggest. I don't know if I would have to reindex the human ref genome afterwards. Thanks!

joshfactorial commented 2 months ago

That sounds like you may have grabbed the protein reference, maybe? NEAT currently only works with DNA. Maybe the new revision of HG38 is doing something different.


From: Daniel Turégano @.> Sent: Thursday, July 11, 2024 12:37 PM To: ncsa/NEAT @.> Cc: Allen, Josh @.>; Comment @.> Subject: Re: [ncsa/NEAT] Error when generating variants where there is a degenerate symbol in the reference (Issue #122)

I've been inspecting the ref files and apparently there are some ambiguous characters like K, Y, M, R, W... I guess I'll just have to preprocess them as you suggest. Thanks!

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/ncsa/NEAT/issues/122*issuecomment-2223511946__;Iw!!DZ3fjg!4f4ua_bm2dx3Wtv7K3OhJQ8ePDlvcx45BlqFycLzg3MdWTq7LW0euQv6TTnj5nMQ_5BxPUXYETVu7jWCpbqGTHW5c7q4jw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AGMI727I3FW5FNDQCCMTJYDZL27ELAVCNFSM6AAAAABKW5UKFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGUYTCOJUGY__;!!DZ3fjg!4f4ua_bm2dx3Wtv7K3OhJQ8ePDlvcx45BlqFycLzg3MdWTq7LW0euQv6TTnj5nMQ_5BxPUXYETVu7jWCpbqGTHV3YIybbw$. You are receiving this because you commented.Message ID: @.***>

dani-ture commented 2 months ago

It is the DNA reference indeed, but there are just a few of these degenerate bases spilled over the reference to indicate variation or uncertainty in the assembly. image

You can read more here: https://en.wikipedia.org/wiki/Nucleic_acid_notation

joshfactorial commented 2 months ago

okay, just haven't run into those yet I guess in the wild.

joshfactorial commented 2 months ago

You might try HG19 or some older version of the reference.