Problems with Cigar Strings in adding loci to db

plendere commented 10 months ago

Hello!

Here is the issue I am facing when trying to add a .fa file for a specific locus to db (locityper add).

[12:07:40 ESC[36mDEBUGESC[0m] locityper add -d HG03516/db/ -s HG03516/HG03516_MUC5B_LoO.fa=MUC5B [12:07:40 ESC[36mDEBUGESC[0m] locityper v0.5.3-3 @ 2023-08-25 12:07:40 [12:07:40 ESC[33m WARNESC[0m] [MUC5B] There are variants on the boundary of the locus [12:07:40 ESC[32m INFOESC[0m] Writing 206 haplotypes to HG03516/db/loci/MUC5B/ [12:07:40 ESC[32m INFOESC[0m] Calculating haplotype divergence [12:07:40 ESC[36mDEBUGESC[0m] Aligning sequences in 8 threads thread '' panicked at 'Unexpected CIGAR operation '^@'', src/seq/wfa.rs:131:14 note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

Perhaps there is an issue with WFA2-lib contributing to this error?

projectoriented commented 10 months ago

Hi, thank you for developing this!

I helped @plendere compile locityper and git cloned the latest WFA2-lib and locityper at commit hash #e0dd049. I'm seeing you've made further commits to the repo, so I can compile it for the latest version & let you know if we still experience this error?

tprodanov commented 10 months ago

Hi @projectoriented @plendere Yes, for now I cannot reproduce the error, so I added more information to the error message. Can you please recompile at hash #ad223eb? Then, hopefully, the program will not panic, but will output error to stderr, can you please send it to me?

projectoriented commented 10 months ago

Yup! I recompiled it at that hash. This is our latest error- please let us know if you have any insights:

[15:27:26 ESC[36mDEBUGESC[0m] locityper preproc -i data/Illumina/fastq/ILLUMINA_HG03540_1.fastq.gzdata/Illu
mina/fastq/ILLUMINA_HG03540_2.fastq.gz -d HG03540/db/ -r chm13_v1.1_plus38Y_masked.fasta --threads 4 -o HG03540/analysis
[15:27:26 ESC[36mDEBUGESC[0m] locityper v0.5.3-8 @ 2023-08-29 15:27:26
[15:27:26 ESC[32m INFOESC[0m] Loading background non-duplicated region into memory
[15:27:27 ESC[32m INFOESC[0m] Calculating mean read length
[15:27:27 ESC[32m INFOESC[0m] Mean read length = 150.0
[15:27:27 ESC[32m INFOESC[0m] Mapping reads to the reference
[15:27:27 ESC[36mDEBUGESC[0m] Using random seed 17984249459300061562
[15:27:27 ESC[36mDEBUGESC[0m]     _subsample_ --rate 0.1 -i data/Illumina/fastq/ILLUMINA_HG03540_1.fastq.gz data/Illumina/fastq/ILLUMINA_HG03540_2.fastq.gz | /opt/strobealign/build/strobealign -N 0 -R 0 -U -f 0.001 --eqx -t 4 -r 150 --interleaved chm13_v1.1_plus38Y_masked.fasta - | /opt/samtools/samtools-1.18/samtools view -b -F 3852 -q 20 -L HG03540/db/bg/bg.bed -o HG03540/analysis/bg/aln.tmp.bam
[15:27:27 ESC[36mDEBUGESC[0m]
[15:27:27 ESC[36mDEBUGESC[0m]     Finished in 0:00:00.101
[15:27:27 ESC[31mERRORESC[0m] Successfully killed child process
[15:27:27 ESC[31mERRORESC[0m] Finished with an error:
Error while running subprocess, exit status: 1
Stdout: [main_samview] fail to read the header from "-".

tprodanov commented 10 months ago

Once again, I am not entirely sure what is happening here, but I modified the code to provide more information on error. What strobealign version do you use? Overall, please recompile with the latest commit, and send me new error message, if there will be any.

Also, were there any errors in locityper add?

projectoriented commented 10 months ago

Hi @tprodanov , RE: my previous message: I'm starting to think this may be an issue with Snakemake not being able to handle the return code from child processes because when I ran the command independent of the workflow manager- it worked.

Peep here:

[16:24:07 DEBUG] /root/.cargo/bin/locityper preproc -i data/Illumina/fastq/ILLUMINA_NA19129_1.fastq.gz data/Illumina/fastq/ILLUMINA_NA19129_2.fastq.gz -d GM19129/db -r chm13_v1.1_plus38Y_masked.fasta --threads 4 -o out-GM19129/analysis
[16:24:07 DEBUG] locityper v0.5.3-8 @ 2023-08-29 16:24:07
[16:24:07  INFO] Loading background non-duplicated region into memory
[16:24:08  INFO] Calculating mean read length
[16:24:08  INFO] Mean read length = 150.0
[16:24:08  INFO] Mapping reads to the reference
[16:24:08 DEBUG] Using random seed 7174794796044194426
[16:24:08 DEBUG]     _subsample_ --rate 0.1 -i data/Illumina/fastq/ILLUMINA_NA19129_1.fastq.gz data/Illumina/fastq/ILLUMINA_NA19129_2.fastq.gz | /opt/strobealign/build/strobealign -N 0 -R 0 -U -f 0.001 --eqx -t 4 -r 150 --interleaved chm13_v1.1_plus38Y_masked.fasta - | /opt/samtools/samtools-1.18/samtools view -b -F 3852 -q 20 -L GM19129/db/bg/bg.bed -o out-GM19129/analysis/bg/aln.tmp.bam
This is strobealign 0.11.0-72-gf6b6208
Time reading reference: 6.69 s
Reference size: 3112.23 Mbp (26 contigs; largest: 248.39 Mbp)
Indexing ...
  Time counting seeds: 21.31 s
  Time generating seeds: 40.42 s
  Time sorting seeds: 44.46 s
  Time generating hash table index: 7.60 s
Total time indexing: 113.80 s
Running in paired-end mode
 Mapped        79.01 M reads @    26.50 us/read
Done!
Total mapping sites tried: 136058842
Total calls to ssw: 36809573
Inconsistent NAM ends: 701376
Tried NAM rescue: 0
Mates rescued by alignment: 23415606
Total time mapping: 2093.61 s.
Total time reading read-file(s): 327.77 s.
Total time creating strobemers: 134.23 s.
Total time finding NAMs (non-rescue mode): 365.46 s.
Total time finding NAMs (rescue mode): 0.00 s.
Total time sorting NAMs (candidate sites): 13.72 s.
Total time base level alignment (ssw): 1211.67 s.
Total time writing alignment to files: 0.00 s.
[17:01:04 DEBUG]
[17:01:04 DEBUG]     Finished in 0:36:55.940
[17:01:04 DEBUG] Loading mapped reads into memory (out-GM19129/analysis/bg/aln.bam)
[17:01:04 DEBUG]     Loaded 120943 alignments, discarded 4888
[17:01:04  INFO] Estimating insert size distribution from 58824 read pairs
[17:01:04  INFO]     FR/RF:    58824 (1.000%),  allowed
[17:01:04  INFO]     FF/RR:        0 (0.000%),  forbidden
[17:01:04  INFO]     Insert size mean = 430.9,  st.dev. = 101.1
[17:01:04  INFO]     Allowed insert size: [171, 841]  (99.9%-confidence interval)
[17:01:04 DEBUG]     Using 100 bp windows, (window neighbourhood size 300, boundary 1000)
[17:01:04 DEBUG]     Total windows:     44980
[17:01:04 DEBUG]     Remove 0 windows with Ns, 17803 windows with common k-mers
[17:01:04 DEBUG]     After filtering:   27177
[17:01:04  INFO] Estimating read error profiles from 117404 reads
[17:01:04  INFO]         11120710 matches    (0.997493)
[17:01:04  INFO]            24636 mismatches (0.002210)
[17:01:04  INFO]             1601 insertions (0.000144)
[17:01:04  INFO]             1708 deletions  (0.000153)
[17:01:04  INFO]     Maximum allowed edit distance: 3 (for read length 150, 99%-confidence interval)
[17:01:04 DEBUG]     Few windows (< 100) with GC-content < 28 or > 77, bluring distributions
[17:01:04  INFO]     Read depth mean = 25.09,  variance: 34.94  (at GC-content 40)
[17:01:04  INFO] Success. Total time: 0:36:56.711

As for locityper add, @plendere may have a better answer for that. Thank you for your responses! 🚀

plendere commented 10 months ago

@tprodanov I think I solved the locityper add issue with cigar strings, I think it was something wrong with my input fasta files. however, I am having the same issues with locityper preproc as Mei in Snakemake. Do you have any thoughts as to why this may be happening?

tprodanov commented 10 months ago

@plendere What version did you use? At some point I added more information to error messages, if you pull updates and recompile, maybe the error can help me figure out what is wrong.

Also, for some reason preprocessing in the log file above takes much longer than in my case, not entirely sure why.

projectoriented commented 10 months ago

@tprodanov , I recompiled and built the container based on version 0.6.5-6. Using Snakemake to manage the workflow-I'm discovering that when I run this locally without distributing it to various compute nodes- it runs successfully. However, when I distribute it for hundreds of samples, the below are error messages we get for some whereas others succeed. I have a hunch it may be the environment of the specific compute node the job is sent to or it may be the input files.

What's mind boggling is, samtools executable can be found but not strobealign. And this is the path variable inside the container:

$ echo $PATH
/opt/htslib/htslib-1.18:/opt/samtools/samtools-1.18:/opt/minimap2/minimap2-2.26_x64-linux:/opt/strobealign/build:/root/.cargo/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

Activating singularity image containers/locityper/0.6.5-6/locityper_0.6.5-6.sif
[12:51:57 ESC[36mDEBUGESC[0m] locityper preproc -i data/Illumina/fastq/ILLUMINA_NA19129_1.fastq.gz data/Illumina/fastq/ILLUMINA_NA19129_2.fastq.gz -d results/GM19129/db/ -@ 8 --technology illumina -r chm13_v1.1_plus38Y_masked.fasta --debug -o results/GM19129/analysis/
[12:51:57 ESC[36mDEBUGESC[0m] locityper v0.6.5-6 @ 2023-09-08 12:51:57
[12:51:57 ESC[32m INFOESC[0m] Recognized CHM13 reference genome, using background region chr17:72950001-77450000
[12:51:58 ESC[32m INFOESC[0m] Calculating k-mer counts on the background region
[12:52:05 ESC[32m INFOESC[0m] Calculating mean read length
[12:52:05 ESC[32m INFOESC[0m] Mean read length = 150.0
[12:52:05 ESC[32m INFOESC[0m] Mapping reads to the reference
[12:52:05 ESC[36mDEBUGESC[0m] Using random seed 2020486896464004164
[12:52:05 ESC[36mDEBUGESC[0m]     _subsample_ --rate 0.1 -i data/Illumina/fastq/ILLUMINA_NA19129_1.fastq.gz data/Illumina/fastq/ILLUMINA_NA19129_2.fastq.gz | /opt/strobealign/build/strobealign -N 0 -R 0 -U -f 0.001 --eqx -t 8 -r 150 --no-progress --interleaved chm13_v1.1_plus38Y_masked.fasta - | /opt/samtools/samtools-1.18/samtools view -b -F 3852 -q 20 -L results/GM19129/analysis/bg/tmp.bed -o results/GM19129/analysis/bg/tmp.bam
[12:52:05 ESC[31mERRORESC[0m] Finished with an error:
Subprocess error:
strobealign version unavailable
    Exit code: unknown
samtools 1.18
    Exit code: 1
    Stderr: [main_samview] fail to read the header from "-".

If you're able to help here, that would be wonderful but in the meantime I will contact our IT department. Below is the successful log:

Activating singularity image Activating singularity image containers/locityper/0.6.5-6/locityper_0.6.5-6.sif
[12:53:21 DEBUG] locityper preproc -i data/Illumina/fastq/ILLUMINA_NA19129_1.fastq.gz data/Illumina/fastq/ILLUMINA_NA19129_2
.fastq.gz -d results/GM19129/db/ -@ 8 --technology illumina -r chm13_v1.1_plus38Y_masked.fasta --debug -o results/GM19129/analysis/
[12:53:21 DEBUG] locityper v0.6.5-6 @ 2023-09-08 12:53:21
[12:53:21  INFO] Recognized CHM13 reference genome, using background region chr17:72950001-77450000
[12:53:21  INFO] Calculating k-mer counts on the background region
[12:53:33  INFO] Calculating mean read length
[12:53:33  INFO] Mean read length = 150.0
[12:53:33  INFO] Mapping reads to the reference
[12:53:33 DEBUG] Using random seed 10458837848804174845
[12:53:33 DEBUG]     _subsample_ --rate 0.1 -i data/Illumina/fastq/ILLUMINA_NA19129_1.fastq.gz data/Illumina/fastq/ILLUMINA_NA19129_2.fastq.gz | /opt/strobealign/build/strobealign -N 0 -R 0 -U -f 0.001 --eqx -t 8 -r 150 --no-progress --interleaved chm13_v1.1_plus38Y_masked.fasta - | /opt/samtools/samtoo
ls-1.18/samtools view -b -F 3852 -q 20 -L results/GM19129/analysis/bg/tmp.bed -o results/GM19129/analysis/bg/tmp.bam
...
[13:15:10 DEBUG]     Finished in 0:21:37.562
[13:15:10 DEBUG] Loading mapped reads into memory (results/GM19129/analysis/bg/aln.bam)
[13:15:11 DEBUG]     Loaded 120668 alignments, discarded 4899
[13:15:11  INFO] Estimating insert size distribution from 58700 read pairs
[13:15:11  INFO]     FR/RF:    58698 (1.000%),  allowed
[13:15:11  INFO]     FF/RR:        2 (0.000%),  forbidden
[13:15:11  INFO]     Insert size: observed 431.4 ± 100.9, fitted 431.4 ± 100.9
[13:15:11  INFO]     Allowed insert size: [171, 840]  (99.9%-confidence interval)
[13:15:11 DEBUG]     Using 100 bp windows, (window neighbourhood size 300, boundary 1000)
[13:15:11 DEBUG]     Total windows:     44980
[13:15:11 DEBUG]     Remove 0 windows with Ns, 17803 windows with common k-mers
[13:15:11 DEBUG]     After filtering:   27177
[13:15:11  INFO] Estimating read error profiles from 117142 reads
[13:15:11  INFO]         11037628 matches    (0.997525)
[13:15:11  INFO]            24367 mismatches (0.002202)
[13:15:11  INFO]             1537 insertions (0.000139)
[13:15:11  INFO]             1483 deletions  (0.000134)
[13:15:11  INFO]     Maximum allowed edit distance: 3 (for read length 150, 99%-confidence interval)
[13:15:11 DEBUG]     Few windows (< 100) with GC-content < 28 or > 77, bluring distributions
[13:15:11  INFO]     Read depth mean = 24.40,  variance: 28.86  (at GC-content 40)
[13:15:11  INFO] Success. Total time: 0:21:49.666

projectoriented commented 10 months ago

Okay, apologies for the spam but to be clear- locityper is doing good and the problem with preproc stems from our cluster. If @plendere has no additional hurdles, then this issue can be closed. Thanks a bunch!

tprodanov / locityper

Problems with Cigar Strings in adding loci to db #1