ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
502 stars 92 forks source link

recommended tests and test data #191

Closed mr-c closed 4 years ago

mr-c commented 4 years ago

Hello!

wtdbg2 is packaged for Debian https://tracker.debian.org/pkg/wtdbg2

We'd love to run some tests, can you include some in the repo or point us to freely licenced test data?

Thanks!

ruanjue commented 4 years ago

Please have a look at those dataset:

ONT: 
Escherichia coli  1.2G 
   SRR11475550 (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11475550)   
Caenorhabditis elegans 3.3G
   SRR11456709(https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR11456709)
PACBIO:
Escherichia coli   PacBio RS II   1.3G
    SRR8494908 (https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8494908)
mr-c commented 4 years ago

Thank you @ruanjue for the response. I'm happy to personally validate using data of that size, but for Debian we can't use such large data. Is there a smaller dataset that you recommend?

For both the original recommendation and the smaller dataset, how exactly should we invoke the tools and is there a specific expected result?

mr-c commented 4 years ago

This resource may be useful to you: https://bssw.io/items/what-is-cse-software-testing

ruanjue commented 4 years ago

I am afraid I was giving the smallest dataset I know. Otherwise, we should mimic test data.

mr-c commented 4 years ago

Okay. How should we use this data to test the wtdbg2 tools?

ruanjue commented 4 years ago

Lets talk about SRR8494908.

#download rawdata
prefetch SRR8494908
#converting
fastq-dump --fasta --gzip SRR8494908.sra
#assembling
wtdbg2.pl -o ecoli -t 8 -x rs -g 4.5m SRR8494908.fasta.gz
#the final contigs
ls ecoli.cns.fa
ruanjue commented 4 years ago

prefetch and fastq-dump can be found within https://github.com/ncbi/sra-tools.

mr-c commented 4 years ago
#the final contigs
ls ecoli.cns.fa

Should ecoli.cns.fa have a specific size or md5sum checksum?

ruanjue commented 4 years ago

The content of result file may be various.

mr-c commented 4 years ago

How could we test the contents of the file to determine if wtdbg2.pl is functioning correctly?

ruanjue commented 4 years ago

Aligning them against reference ecoli genome. I think it is complex to auto test the correctness, the best way is skip to check that.

mr-c commented 4 years ago

Hmm.. The entire point is to have a test that we can run automatically. I don't think a reference genome alignment is unreasonable.

If you had unit tests for parts of functionality, that would be an acceptable alternative.

ruanjue commented 4 years ago

A easy way is to check the file size, ·ecoli.cns.fa· is about 4.6MB, lets set a range for it, 4.4 ~ 4.8 MB.