pangenome / pggb

the pangenome graph builder
https://doi.org/10.1038/s41592-024-02430-3
MIT License
368 stars 41 forks source link

seqwish std::invalid_argument error #199

Closed Boer223 closed 2 years ago

Boer223 commented 2 years ago

Hi,

When I use the following command to build pan-genome graph with 19 genomes, it occurs error. Command:

pggb -i 19-genomes.merge.fa -n 19 -o output -p 90 -s 100000 -t 5 -T 5 -M -Z -a wfmash-3TaQ4Q

Error:

[seqwish] WARNING: input alignment file wfmash-3TaQ4Q does not have CIGAR strings. The resulting graph will only represent the input sequences.
[seqwish::seqidx] 0.001 indexing sequences
[seqwish::seqidx] 162.481 index built
[seqwish::alignments] 162.481 processing alignments
terminate called after throwing an instance of 'std::invalid_argument'
  what():  stoi
seqwish -t 5 -s 19-genomes.merge.fa -p wfmash-3TaQ4Q -k 47 -g output/wfmash-3TaQ4Q.e34d4cd.seqwish.gfa -B 10000000 -f 0 -P
103.83s user 38.57s system 87% cpu 162.72s total 2366304Kb max memory
AndreaGuarracino commented 2 years ago

You can't use the wfmash-xxxx file as -a/--input-paf because it is a temporary file of wfmash that contains only the mappings, that is the regions to align, so there are no CIGAR strings in it. seqwish warns you of this ([seqwish] WARNING: input alignment file wfmash-3TaQ4Q does not have CIGAR strings). Moreover, it seems that such a file presents invalid information in it, which is triggering the error. Try running pggb by using the output of wfmash (in your case, it should be called output/wfash-3TaQ4Q.paf).

Boer223 commented 2 years ago

@AndreaGuarracino Thank you for your quickly reply! But when I use pggb -i 19-genomes.merge.fa -n 19 -o output -p 90 -s 100000 -t 5 -T 5 -M -Z to create the pan-genome graph, it cannot generate the paf file. There is only a wfmash-3TaQ4Q temp file.

AndreaGuarracino commented 2 years ago

Weird, or maybe you haven't waited long enough. What does the estimated mapping and alignment time say in the log? I suggest reducing -s 50000 and waiting a bit more. If the problem persists, please share the output/...log file.

Boer223 commented 2 years ago

It occurs the following log at last.

[E::fai_load3_core] Failed to open FASTA file 19-genomes.merge.fa
wfmash -X -s 100000 -p 90 -n 18 -t 16 19-genomes.merge.fa 19-genomes.merge.fa
15440.41s user 792.33s system 1172% cpu 1384.73s total 7245936Kb max memory
Boer223 commented 2 years ago

log.txt

AndreaGuarracino commented 2 years ago
[E::fai_load3_core] Failed to open FASTA file 19-genomes.merge.fa

It is not able to see the FASTA file in input, very strange. Can I see your 19-genomes.merge.fa.fai file too? And also head /home/cuixb/data/analysis_data/graph-pan-genome/pggb-result/wfmash-3TaQ4Q?

Boer223 commented 2 years ago

19-genomes.merge.fa.fai file: 19-genomes.merge.fa.zip

head of wfmash-3TaQ4Q file:

Darmor_v10#1#A01    32958928    27800000    28300000    +   Darmor_v10#1#C01    48239358    47247687    47879060    5741    631373  10  id:f:90.9308
Darmor_v10#1#A01    32958928    0   3800000 +   Darmor_v5#1#chrC01  38829317    850 4733913 44055   4733063 12  id:f:93.0793
Darmor_v10#1#A01    32958928    27000000    29700000    +   Darmor_v5#1#chrC01  38829317    35738139    38333401    25321   2700000 12  id:f:93.7809
Darmor_v10#1#A01    32958928    30500000    31200000    +   Darmor_v5#1#chrC01  38829317    38267435    38823342    6814    700000  16  id:f:97.3442
Darmor_v10#1#A01    32958928    29900000    30500000    -   Darmor_v5#1#chrAnn_random   48658326    1918964 2515790 5847    600000  16  id:f:97.4553
Darmor_v10#1#A01    32958928    15700000    16300000    -   Darmor_v5#1#chrAnn_random   48658326    3155785 3717399 5876    600000  17  id:f:97.9259
Darmor_v10#1#A01    32958928    27800000    28300000    +   Express617#1#chrC01 44118044    38888171    39510831    5664    622660  10  id:f:90.972
Darmor_v10#1#A01    32958928    28700000    29900000    +   Express617#1#chrC01 44118044    40944888    42168781    11515   1223893 12  id:f:94.0823
Darmor_v10#1#A01    32958928    27800000    28300000    +   FAFU_ZS11#1#chrC01  54641295    49487432    50101595    5581    614163  10  id:f:90.8653
Darmor_v10#1#A01    32958928    31200000    31900000    +   FAFU_ZS11#1#chrC01  54641295    50286548    50945152    6412    700000  11  id:f:91.6069

the whole wfmash-3TaQ4Q file: wfmash-3TaQ4Q.zip

AndreaGuarracino commented 2 years ago

The FASTA index seems healthy. The input contains a lot of sequences, but I don't think (hope) that's the problem. Can you try it with other, but smaller FASTA files? With FASTA files in the same folder where your current input is, and also FASTA files present in other folders? I am wondering if there is an issue that is specific to your system. In each test, please also delete and regenerate the FASTA index, to be safe.

ekg commented 2 years ago

The number sequences should not have an effect here. The system has been tested into the millions of input seqs and there should not be any limit.

It seems that you can't read the FASTA.

Please confirm that these return the same value:

cat ref.fa | grep '^>' | wc -l

wc -l ref.fa.fai

On Sun, May 15, 2022, 20:48 Andrea Guarracino @.***> wrote:

The FASTA index seems healthy. The input contains a lot of sequences, but I don't think (hope) that's the problem. Can you try it with other, but smaller FASTA files? With FASTA files in the same folder where your current input is, and also FASTA files present in other folders? I am wondering if there is an issue that is specific to your system. In each test, please also delete and regenerates the FASTA index, to be safe.

— Reply to this email directly, view it on GitHub https://github.com/pangenome/pggb/issues/199#issuecomment-1126995070, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI64OSB42Z6JPJLX4LVKFBHLANCNFSM5V5SDHDA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Boer223 commented 2 years ago

@ekg As you said, I have confirmed the number of sequences of the reference genome and both two files return the same value.

image
Boer223 commented 2 years ago

When I reinstall the whole environment for pggb using conda, it runs successfully without error.