hzi-bifo / Haploflow

GNU General Public License v3.0
25 stars 3 forks source link

Empty files as output #11

Open mokrobial opened 2 years ago

mokrobial commented 2 years ago

Apologies if I have missed a setup step. I installed successfully with conda and was able to run the test data fq without issue. If I input a fastq and run the output is empty. I've tried several different files and it's the same output: 0 vertices. Do reads require some kind of pre-processing first?

Log: Building deBruijnGraph... Building deBruijnGraph took 0.485469 seconds. deBruijnGraph has 0 vertices Building unitig graph from deBruijn graph... Getting connected components Getting CCs took 1.4e-05 seconds Calculating coverage distribution Calculating coverage distribution took 3.4e-05 seconds Unitig graph successfully build in 0.000138 seconds. Unitig graph has 0 vertices Assembling... Cleaning graph Assembly complete Assembly took 0.00033 seconds The complete assembly process took 0.485904 seconds.

AlphaSquad commented 2 years ago

The test data works fine but using your own data it does not? Odd. Could you provide your read-file or a snippet of it? If you did, what value did you provide for k and how long are your reads?

mokrobial commented 2 years ago

I didn't set the --k initially. I just tried with it set to 39 and still empty folders. Read length is 2x150

Github won't let me include a zip file so I've put one here: https://drive.google.com/drive/folders/1SFkD2dDKU1GLpdtEcPoqvLcvGoEYY-fY?usp=sharing

Thanks much!

AlphaSquad commented 2 years ago

Hi sorry that it took so long, I have tested the read files you provided and found that for most of the files all the contig lengths were smaller than 500 bp. Haploflow does not report contigs shorter than 500 bp by default, so no contigs were reported. This might happen because either there are too many strains in the sample - then Haploflow cannot distinguish them by their coverage and avoids misassemblies by breaking contigs apart - or there is no clear signal in the data, because no genome is covered more then let's say 4x or there are too many errors. Haploflow reports (all) contigs, if the filter option is set to 0, but that probably does not make too much sense.

Ruchank1 commented 2 years ago

Hi, I am getting an issue (empty folders, 0 vertices) with the test data file also. Can you please help me with that Thank you.

AlphaSquad commented 2 years ago

Could you post the command you used and the output you received?

Ruchank1 commented 2 years ago

Sure. The command - haploflow --read-file .../forward.fastq --out test --log test/log The output - was empty sub folders in a folder named test. and the log file looked like this - Building deBruijnGraph... Building deBruijnGraph took 0.00039 seconds. deBruijnGraph has 0 vertices Building unitig graph from deBruijn graph... Getting connected components Getting CCs took 2.7e-05 seconds Calculating coverage distribution Calculating coverage distribution took 6.1e-05 seconds Unitig graph successfully build in 0.000286 seconds. Unitig graph has 0 vertices Assembling... Cleaning graph Assembly complete Assembly took 0.000669 seconds The complete assembly process took 0.001141 seconds.

The number of vertices is 0.

AlphaSquad commented 2 years ago

Haploflow should probably use a meaningful value for k as default, but it seems like this is not working right now. Please re-try running Haploflow with setting a value for k, e.g. --k 41

Ruchank1 commented 2 years ago

I tried running the command with setting the k value, but it still shows 0 vertices.

AlphaSquad commented 2 years ago

Could you post your forward.fastq? The toy data set is named HIV_3_toy.fq that's why I am asking.

Ruchank1 commented 2 years ago

Hi, I actually tried with the HIV_3_toy.fq dataset also, I got the same output. So, I can't really figure out what is happening.

AlphaSquad commented 2 years ago

It is odd. The only explanation I have is that Haploflow tries to read a non-existing file. Could you maybe try absolute paths for all files?

Ruchank1 commented 2 years ago

Yes, I tried giving absolute paths as well. I am still getting empty files as output. I installed Haploflow using conda, is there a possibility that I missed out on some step?

Ruchank1 commented 2 years ago

Hi, I tried it on a linux machine as well but it still gives 0 vertices as output. I cannot really locate the problem.

AlphaSquad commented 2 years ago

Hm okay, Haploflow was only tested on UNIX systems, but it is strange that it is not working on a linux machine either. Unfortunately I am not really sure what to do here, since I cannot reproduce this problem. I will however add a check for missing files, but it may take a moment until this change is done and available on conda (and if no file is missing this does not solve your problem either).

reesea22 commented 1 year ago

I have been getting empty files as output for my data as well. When I attempt to run the toy dataset through haploflow I get the following error: $ haploflow --read-file Haploflow/HIV_3_toy.fq.gz --out test --log test/log terminate called after throwing an instance of 'std::out_of_range' what(): vector::_M_range_check: __n (which is 0) >= this->size() (which is 0) Aborted (core dumped)

AlphaSquad commented 1 year ago

Are you also using the conda version/install? If yes, can you try to unzip the read file first?

adelizamae commented 1 year ago

Hi, I also don't have output files. I'm not sure what I'm doing wrong. :(

I ran: haploflow --read-file sample.fastq --k 41 --out test/ --log test/log/

But there is no output file except the Cov.tsv haploflow-no-output

AlphaSquad commented 1 year ago

Hi, I am sorry that Haploflow is not working out of the box for you. Unfortunately I will need a little bit more information to give you any feedback (since the command looks ok): Are you using the conda version or did you build Haploflow yourself? What do the log/Cov.tsv files say? How big is your sample.fastq and how long are the reads?

adelizamae commented 1 year ago

Hi, I used both the conda version and the build. Turns out, there are no contigs greater than 500 in length that's why there is no output in mine.

I have SARS-CoV-2 long read sequences (produced by using ONT) and I would like to know what parameters I can use to do de novo assembly.

I uploaded my sample fastq in this gdrive. https://drive.google.com/drive/folders/1__4TscNV_LJyRbgzjGN-s3ehcB52S5zI?usp=sharing I'm just starting to learn bioinfo, your help is greatly appreciated!