Open dani-ture opened 2 months ago
Yeah, I've never seen a "Y" in the reference before. I can investigate how to handle that. For now I would just just do something like sed -i 's/Y/N/g' genome.fa
to swap out Y's for N's and see if it runs okay.
I've been inspecting the ref files and apparently there are some ambiguous characters like K, Y, M, R, W... I guess I'll just have to preprocess them as you suggest. I don't know if I would have to reindex the human ref genome afterwards. Thanks!
That sounds like you may have grabbed the protein reference, maybe? NEAT currently only works with DNA. Maybe the new revision of HG38 is doing something different.
From: Daniel Turégano @.> Sent: Thursday, July 11, 2024 12:37 PM To: ncsa/NEAT @.> Cc: Allen, Josh @.>; Comment @.> Subject: Re: [ncsa/NEAT] Error when generating variants where there is a degenerate symbol in the reference (Issue #122)
I've been inspecting the ref files and apparently there are some ambiguous characters like K, Y, M, R, W... I guess I'll just have to preprocess them as you suggest. Thanks!
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/ncsa/NEAT/issues/122*issuecomment-2223511946__;Iw!!DZ3fjg!4f4ua_bm2dx3Wtv7K3OhJQ8ePDlvcx45BlqFycLzg3MdWTq7LW0euQv6TTnj5nMQ_5BxPUXYETVu7jWCpbqGTHW5c7q4jw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AGMI727I3FW5FNDQCCMTJYDZL27ELAVCNFSM6AAAAABKW5UKFGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRTGUYTCOJUGY__;!!DZ3fjg!4f4ua_bm2dx3Wtv7K3OhJQ8ePDlvcx45BlqFycLzg3MdWTq7LW0euQv6TTnj5nMQ_5BxPUXYETVu7jWCpbqGTHV3YIybbw$. You are receiving this because you commented.Message ID: @.***>
It is the DNA reference indeed, but there are just a few of these degenerate bases spilled over the reference to indicate variation or uncertainty in the assembly.
You can read more here: https://en.wikipedia.org/wiki/Nucleic_acid_notation
okay, just haven't run into those yet I guess in the wild.
You might try HG19 or some older version of the reference.
Describe the bug
It looks like when neat was generating variants, it found by chance a “Y” in the reference sequence and aborted the variant generation process.
To Reproduce
Steps to reproduce the behavior:
Download the latest human reference genome: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001405.40/
Make a copy of the provided template config file (I called it test_config_human.yml) and set the parameters:
‘’’reference:
target_bed:
produce_vcf: true
produce_fastq: false
rng_seed: 6386514007882411’’’
The rest are left with the “.” as default.
Run neat on the command line:
neat --log-name test --log-detail HIGH --log-level DEBUG read-simulator -c test_config_human.yml -o test
Expected behavior
Generate variants and output them to a vcf file.
Desktop: