shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

Merge error (number of fields < query index field) #37

Closed ericvdtoorn closed 9 months ago

ericvdtoorn commented 10 months ago

When merging the results of a number of kmcp search runs, I get the following output:

> kmcp merge SRR6468718.kmcp_humgut@*.tsv.gz --out-file SRR6468718.kmcp_humgut.tsv.gz
11:28:13.212 [INFO] checking input files ...
11:28:13.213 [INFO]   9 input files given
11:28:13.213 [INFO] merging ...
11:44:06.311 [ERRO] number of fields (13) < query index field (15)

It does produce the out-file, attached the header:

❯ zcat SRR6468718.kmcp_humgut.tsv.gz| head
#query  qLen    qKmers  FPR hits    target  chunkIdx    chunks  tLen    kSize   mKmers  qCov    tCov    jacc    queryIdx
SRR6468718.1    101 81  4.5272e-15  1   GCF_000228045.1_SMUT1-NEX_25-98_genomic 9   10  2182376 21  81  1.0000  0.0004  0.0004  0
SRR6468718.2    101 81  4.5272e-15  385 GUT_GENOME154070    5   10  2006006 21  81  1.0000  0.0004  0.0004  1
SRR6468718.2    101 81  4.5272e-15  385 GUT_GENOME053624    2   10  1545775 21  81  1.0000  0.0005  0.0005  1
SRR6468718.2    101 81  4.5272e-15  385 GUT_GENOME124299    2   10  1797471 21  81  1.0000  0.0005  0.0005  1
SRR6468718.2    101 81  4.5272e-15  385 GUT_GENOME216177    0   10  1912264 21  81  1.0000  0.0004  0.0004  1
SRR6468718.2    101 81  4.5272e-15  385 GUT_GENOME053394    2   10  1925801 21  81  1.0000  0.0004  0.0004  1
SRR6468718.2    101 81  4.5272e-15  385 GUT_GENOME124618    3   10  1946290 21  81  1.0000  0.0004  0.0004  1
SRR6468718.2    101 81  4.5272e-15  385 GUT_GENOME273974    0   10  1955198 21  81  1.0000  0.0004  0.0004  1
SRR6468718.2    101 81  4.5272e-15  385 GUT_GENOME127971    7   10  1964574 21  81  1.0000  0.0004  0.0004  1
shenwei356 commented 10 months ago

Hmm, I'm not sure of the exact reason. According to the error message, some input files might be truncated.

# check corrupted files
for f in SRR6468718.kmcp_humgut@*.tsv.gz; do \
    gzip -t $f; \
done;
ericvdtoorn commented 10 months ago

It's still running, but upon rechecking the logs I see that the last one wasn't done:

==> SRR6468718.kmcp_humgut@9.log <==
23:11:51.022 [INFO] -------------------- [main parameters] --------------------
23:11:51.022 [INFO]   minimum    query length: 30
23:11:51.022 [INFO]   minimum  matched k-mers: 10
23:11:51.022 [INFO]   minimum  query coverage: 0.550000
23:11:51.022 [INFO]   minimum target coverage: 0.000000
23:11:51.022 [INFO] -------------------- [main parameters] --------------------
23:11:51.022 [INFO]
23:11:51.022 [INFO] searching ...
23:11:51.028 [INFO] reading sequence file: SRR6468718_1.fastp.fq.gz
23:25:33.126 [INFO] reading sequence file: SRR6468718_2.fastp.fq.gz

Also adding the output of running:

# check corrupted files
> for f in SRR6468718.kmcp_humgut@*.tsv.gz; do \
    gzip -t $f; \
done;
gzip: SRR6468718.kmcp_humgut@9.tsv.gz: unexpected end of file
shenwei356 commented 10 months ago

So that's the reason. You need to wait until all searching are done before merging :)