rmenegaux / fastDNA

Other
23 stars 13 forks source link

[std::invalid_argument] Cannot be opened for training #1

Open ptynecki opened 5 years ago

ptynecki commented 5 years ago

Hey,

I wanted to use fastDNA but I have an issue which block me.

I prepared merged_training.fasta file which contains many FASTA samples from NCBI. I also created labels.txt file with lables (one by one in new line).

Input:

./fastdna supervised -input merged_training.fasta -labels labels.txt -output model

Output:

libc++abi.dylib: terminating with uncaught exception of type std::invalid_argument: labels.txt cannot be opened for training!
Abort trap: 6

I'm sure that both files are correct with right chmod and permissions.

rmenegaux commented 5 years ago

Hi @ptynecki, Does the example script test/test.sh work ok for you?

ptynecki commented 5 years ago

@rmenegaux

Correct. The output from the test is below:

Training model fdna_k10_d10_e1
Read sequence n1, ENA|CP000473|CP000473.1|kraken:taxid|234267 ENA|CP000473|CP000473.1 Solibacter usitatus Ellin6076, complete genoRead sequence n2, ENA|CP001472|CP001472.1|kraken:taxid|240015 ENA|CP001472|CP001472.1 Acidobacterium capsulatum ATCC 51196, compleRead sequence n3, ENA|CP002542|CP002542.1|kraken:taxid|755732 ENA|CP002542|CP002542.1 Fluviicola taffensis DSM 16823, complete genRead sequence n4, ENA|CP004080|CP004080.1|kraken:taxid|1193806 ENA|CP004080|CP004080.1 Dehalococcoides mccartyi BTF08, complete geRead sequence n5, ENA|CP003947|CP003947.1|kraken:taxid|755178 ENA|CP003947|CP003947.1 Cyanobacterium aponinum PCC 10605, complete Read sequence n6, ENA|CP003856|CP003856.1|kraken:taxid|1100841 ENA|CP003856|CP003856.1 Acinetobacter baumannii TYTH-1, complete geRead sequence n7, ENA|CP004009|CP004009.1|kraken:taxid|1274814 ENA|CP004009|CP004009.1 Escherichia coli APEC O78, complete genome.Read sequence n8, ENA|AE017042|AE017042.1|kraken:taxid|229193 ENA|AE017042|AE017042.1 Yersinia pestis biovar Microtus str. 91001, Read sequence n9, ENA|CP001622|CP001622.1|kraken:taxid|395491 ENA|CP001622|CP001622.1 Rhizobium leguminosarum bv. trifolii WSM1325Read sequence n10, ENA|CP000975|CP000975.1|kraken:taxid|481448 ENA|CP000975|CP000975.1 Methylacidiphilum infernorum V4, complete genome.
Number of sequences 10
Number of labels: 10
Number of words: 1048576
Progress: 100.0% fragments/sec/thread:   16110 lr:  0.000000 loss:  2.027800 ETA:   0h 0m
Testing model fdna_k10_d10_e1
N   10000
P@1 0.197
R@1 0.197
Number of examples: 10000
rmenegaux commented 5 years ago

Hmm... this error is really because the labels file is not opened, so I don't know why it is being thrown if the file is correct (right path, right permissions). Perhaps you could send it to me.

It should be unrelated but I get another error by running your command, because the k-mer length defaults to 0 (that is a bug I need to fix). With the command line interface you should give a minn argument: ./fastdna supervised -input merged_training.fasta -labels labels.txt -output model -minn 4