hzi-bifo / Haploflow

GNU General Public License v3.0
25 stars 3 forks source link

Installation error #10

Closed arunvv90 closed 2 years ago

arunvv90 commented 2 years ago

HI, I was trying to install haploflow. In the last step when I tried to make the executable I go the following error

make[1]: Entering directory '/home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/build' make[2]: Entering directory '/home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/build' Scanning dependencies of target haploflow make[2]: Leaving directory '/home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/build' make[2]: Entering directory '/home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/build' [ 16%] Building CXX object CMakeFiles/haploflow.dir/main.cpp.o [ 33%] Building CXX object CMakeFiles/haploflow.dir/deBruijnGraph.cpp.o [ 50%] Building CXX object CMakeFiles/haploflow.dir/Sequence.cpp.o [ 66%] Building CXX object CMakeFiles/haploflow.dir/Vertex.cpp.o [ 83%] Building CXX object CMakeFiles/haploflow.dir/UnitigGraph.cpp.o /home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/UnitigGraph.cpp: In member function ‘std::vector UnitigGraph::calculate_thresholds(deBruijnGraph&, std::cxx11::string, unsigned int)’: /home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/UnitigGraph.cpp:126:20: warning: unused variable ‘cov’ [-Wunused-variable] for (auto& cov : cov_distr) ^ /home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/UnitigGraph.cpp: In member function ‘std::vector UnitigGraph::get_thresholds(std::vector<std::map<unsigned int, unsigned int> >&, std::cxx11::string, unsigned int)’: /home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/UnitigGraph.cpp:227:33: error: ‘isnan’ was not declared in this scope if (!isnan(v) and window[pos - 1] < v) ^ /home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/UnitigGraph.cpp:227:33: note: suggested alternatives: In file included from /usr/include/c++/5/random:38:0, from /usr/include/c++/5/bits/stl_algo.h:66, from /usr/include/c++/5/algorithm:62, from /home/arun/anaconda3/envs/npsm/include/boost/smart_ptr/shared_ptr.hpp:39, from /home/arun/anaconda3/envs/npsm/include/boost/shared_ptr.hpp:17, from /home/arun/anaconda3/envs/npsm/include/boost/property_map/vector_property_map.hpp:14, from /home/arun/anaconda3/envs/npsm/include/boost/property_map/property_map.hpp:602, from /home/arun/anaconda3/envs/npsm/include/boost/graph/graphviz.hpp:19, from /home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/UnitigGraph.h:4, from /home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/UnitigGraph.cpp:1: /usr/include/c++/5/cmath:648:5: note: ‘std::isnan’ isnan(_Tp __x) ^ In file included from /home/arun/anaconda3/envs/npsm/include/boost/lexical_cast/detail/inf_nan.hpp:35:0, from /home/arun/anaconda3/envs/npsm/include/boost/lexical_cast/detail/converter_lexical_streams.hpp:63, from /home/arun/anaconda3/envs/npsm/include/boost/lexical_cast/detail/converter_lexical.hpp:54, from /home/arun/anaconda3/envs/npsm/include/boost/lexical_cast/try_lexical_convert.hpp:44, from /home/arun/anaconda3/envs/npsm/include/boost/lexical_cast.hpp:32, from /home/arun/anaconda3/envs/npsm/include/boost/property_map/dynamic_property_map.hpp:23, from /home/arun/anaconda3/envs/npsm/include/boost/graph/graphviz.hpp:25, from /home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/UnitigGraph.h:4, from /home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/UnitigGraph.cpp:1: /home/arun/anaconda3/envs/npsm/include/boost/math/special_functions/fpclassify.hpp:606:14: note: ‘boost::math::isnan’ inline bool (isnan)(T x) ^ make[2]: [CMakeFiles/haploflow.dir/build.make:115: CMakeFiles/haploflow.dir/UnitigGraph.cpp.o] Error 1 make[2]: Leaving directory '/home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/build' make[1]: [CMakeFiles/Makefile2:76: CMakeFiles/haploflow.dir/all] Error 2 make[1]: Leaving directory '/home/arun/catfish/snmk_py3.6/bioram_snmkfile/Haploflow/build' make: *** [Makefile:84: all] Error 2 Looking forward to the help. Thank you

AlphaSquad commented 2 years ago

Interesting, that error never occured for my compiler, but there was a namespace missing. Can you please test again if Haploflow compiles now?

arunvv90 commented 2 years ago

Thank you. It solved the problem. I was in the middle of installing old boost and GCC libraries. Your immediate reply saved my day!!!!

AlphaSquad commented 2 years ago

Glad I could help! I also added an information to the README that Haploflow now can installed via bioconda which saves you from the trouble of having to manually build it (It might not be the most recent version though). That information probably have been useful beforehand, for that I am sorry.

arunvv90 commented 2 years ago

Really!!! That is the best news. I could not install it on my server because I don't have root access. This solves my problem. It does not support nanopore reads right? I have both Illumina and nanopore reads of herpesvirus. I have tried all other assemblers and they collapse in the repeat region. I am stuck at this for months and haploflow is my one of the last hope. Do we need paired Illumina data for haploflow. Earlier you mentioned that either forward or reverse reads are enough. I am a computational biology PhD guy. I am curious that if you have you seen any difference in the quality of the assembly if only forward or both forward & reverse reads are combined. If I have to use both forward and reverse reads, what is your suggestion as a method to combine them? Thank you

arunvv90 commented 2 years ago

Hi, I checked the conda installation. The only available version is 0.1. Can you add version 1 if possible? Is there a significant difference in the results if I use 0.1 and version 1?

AlphaSquad commented 2 years ago

There should not be any significant differences, version 0.1 is the one used for the publication of Haploflow, since then it was more "cosmetic" changes and improved debugging information. I will look into how to update the conda version though.

You definitely can use Nanopore reads as well as paired-end reads, you only need to provide a single fastq-file. If you have paired-end reads using e.g. cat Forward.fq Reverse.fq > Reads.fq to combine them should be enough. Unfortunately, Haploflow is not able to use the additional information of long reads or paired-end reads, so particularly long reads might reduce the accuracy of Haploflow (because of the higher error rates). Paired-end reads might improve assembly quality because of the additional "increase" in sequencing depth.

Hopefully Haploflow will be able to resolve the repeat region. Whether it works or not - in any case I would be happy to hear about it.

arunvv90 commented 2 years ago

Thanks. One of my areas of interest is to benchmark different assemblers with respect to virus genomes. In general, all the assemblers are benchmarked with the human and bacterial genomes. I have 20 genomes of herpesviruses sequenced with both Illumina and nanopore. I have simulated data. I am planning to extend this to other important viruses too. I am working on that data with different assemblers. It will be interesting to see the comparison of haploflow assembly produced using nanopore and Illumina data. If you are interested, please join me in the benchmarking project as a collaborator. Please send me an email to arunvv90@gmail.com. Anyway I will update the results

arunvv90 commented 2 years ago

Here are the results My virus is a fish herpes virus which is similar to HSV-1. The expected genome size is 135kb. The genome has a unique region and is flanked by two repeat regions on both sides. Both repeat regions are similar. I have tried to run haploflow and it produced large number of contigs with broken assembly. I am attaching the quast report and screenshots where assemblies are compared to the ATCC type strain. Names of the assemblies as shown in the quast reports are as follows: Il_com_contigs- Illumina reads combining both forward and reverse reads il_com_k51_contigs- Illumina reads combining both forward and reverse reads and k has changed from default 41 to 51 il_com_k61_contigs- Illumina reads combining both forward and reverse reads and k has changed from default 41 to 61 Il_for_contigs- Illumina reads only with forward reads il_for_k51_contigs-Illumina reads only with reverse reads il_rev_k51contigs-Illumina reads only with reverse reads and k has changed from default 41 to 51 np_canu_1000l_80x_contigs-nanopore reads corrected with canu and filtered out the reads length below 1000 and 80X data is selected Raw nanopore - 80X reads without any error correction produced 3 kb contig and removed from quast analysis. As you can see from the quast report, different data sets solve different parts of the genome. Maybe I can merge these contigs to get a consensus assembly Can you help me to get better assembly with larger contigs and more genome coverage? My objective is to get strain-resolved genomes. If the sample has two strains, how can I get two separate genomes? I saw a script (https://github.com/hzi-bifo/Haploflow_supplementary) as supplementary information of the publication. I could see some python script in this depository too. I am a bit confused with the documentation. Can you help me with how to merge and create consensus genomes? If you need my raw data for testing, I would be happy to share it. I have both Illumina and nanopore data. Any help is highly appreciated. Thank you

q1 q3 q2

report.pdf

/edit A.F.: fixed the links

AlphaSquad commented 2 years ago

Thank you for the data, interesting to see. Is this real or simulated data? Is the QUAST reference genome exactly the one in the dataset/read set? Do other assemblers resolve the middle part in a single contig? Are there (two, multiple) strains in your data set or are you only interested in the repeat region? How does the flow of the contigs in the repeat regions look like? E.g. the flow for Il_for_contigs is 600-700, if the flow of the contigs in the repeat region is ~1200-1400 then this is a sign that Haploflow is also collapsing the repeats. If the flow is lower (also around 600-700), then the contigs for the rest of the repeat region might be even shorter. Haploflow does this to avoid mismatches/misassemblies if the coverage of two "strains" is close to 50% - which is what happens in a repeat region. Otherwise Haploflow might produce chimeric contigs.

Regarding the supplementary: The script produces full genomes of the strains: These have the base of the reference if not covered and of the contigs otherwise. For that you need the (Haploflow) contigs, a reference genome and a couple of files produced by QUAST: The SNP file, the Coords file and the report containing the Duplication ratio as well as the bam-mapping file. The contigs from Haploflow get clustered based on their flow into a number of clusters based on the Duplication ratio of QUAST (assuming multiple strains and one reference genome). If there is only one strain, the Duplication ratio (as in your report) is ~1, then there will be only one cluster and only one genome is produced. You could theoretically use this method with the contigs from Il_for_contigs and Il_rev_contigs and then should have both ends covered by contigs produced by Haploflow.

Looking at the Nanopore data it looks like following this up further does not seem promising for your dataset.

If you could provide the raw data I would be happy to have a look and see whether some different parameter set produces better results.

arunvv90 commented 2 years ago

Thanks for the quick reply This is real data. Is the QUAST reference genome exactly the one in the dataset/read set?- We do not know. That is my objective. Reference genome was sequenced in 1992. I have another 20 genomes collected over the last 30 years which may be or may not be the same. Do other assemblers resolve the middle part in a single contig?- Yes Canu solved it as a single contig with nanopore reads but it collapses in repeat regions or gives misassembly Supplementary- I can see deletions when raw reads are mapped to reference and visualized with IGV. I want to keep them as such rather than replacing them with bases from reference. There are almost 12000 files in coverage and 600 files in graphs. I am not sure which one is which. I hope you can find it after running my raw data Reference genome - https://www.ncbi.nlm.nih.gov/nuccore/NC_001493.2 I am attaching the raw data as a google drive link. Please let me know if you can't access it. S99-1170_IL_kr_R2_001_val_2.fq.gz https://drive.google.com/file/d/1G60HLZrDqhAIRHIN4ZF2V8pdzU2O9xb4/view?usp=drive_web S99-1170_IL_kr_R1_001_val_1.fq.gz https://drive.google.com/file/d/1W6ZL1q53edU3kgiO4hdJk3zzgLlQ2DbS/view?usp=drive_web

Arun Venugopalan Ph.D Scholar Infectious Diseases, Basic Science Dept. College of Veterinary Medicine Mississippi State University, USA

On Thu, Sep 30, 2021 at 7:41 AM Adrian Fritz @.***> wrote:

Thank you for the data, interesting to see. Is this real or simulated data? Is the QUAST reference genome exactly the one in the dataset/read set? Do other assemblers resolve the middle part in a single contig? Are there (two, multiple) strains in your data set or are you only interested in the repeat region? How does the flow of the contigs in the repeat regions look like? E.g. the flow for Il_for_contigs is 600-700, if the flow of the contigs in the repeat region is ~1200-1400 then this is a sign that Haploflow is also collapsing the repeats. If the flow is lower (also around 600-700), then the contigs for the rest of the repeat region might be even shorter. Haploflow does this to avoid mismatches/misassemblies if the coverage of two "strains" is close to 50% - which is what happens in a repeat region. Otherwise Haploflow might produce chimeric contigs.

Regarding the supplementary: The script produces full genomes of the strains: These have the base of the reference if not covered and of the contigs otherwise. For that you need the (Haploflow) contigs, a reference genome and a couple of files produced by QUAST: The SNP file, the Coords file and the report containing the Duplication ratio as well as the bam-mapping file. The contigs from Haploflow get clustered based on their flow into a number of clusters based on the Duplication ratio of QUAST (assuming multiple strains and one reference genome). If there is only one strain, the Duplication ratio (as in your report) is ~1, then there will be only one cluster and only one genome is produced. You could theoretically use this method with the contigs from Il_for_contigs and Il_rev_contigs and then should have both ends covered by contigs produced by Haploflow.

Looking at the Nanopore data it looks like following this up further does not seem promising for your dataset.

If you could provide the raw data I would be happy to have a look and see whether some different parameter set produces better results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hzi-bifo/Haploflow/issues/10#issuecomment-931284943, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2C6BMDGGIAXPCBPXU44TLUERLF5ANCNFSM5E75X7VA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

arunvv90 commented 2 years ago

Maybe my data is not accessible from the above mail. Please find the link below https://drive.google.com/file/d/1G60HLZrDqhAIRHIN4ZF2V8pdzU2O9xb4/view?usp=sharing https://drive.google.com/file/d/1W6ZL1q53edU3kgiO4hdJk3zzgLlQ2DbS/view?usp=sharing

Arun Venugopalan Ph.D Scholar Infectious Diseases, Basic Science Dept. College of Veterinary Medicine Mississippi State University, USA

On Thu, Sep 30, 2021 at 8:27 AM Arun Venugopalan @.***> wrote:

Thanks for the quick reply This is real data. Is the QUAST reference genome exactly the one in the dataset/read set?- We do not know. That is my objective. Reference genome was sequenced in

  1. I have another 20 genomes collected over the last 30 years which may be or may not be the same. Do other assemblers resolve the middle part in a single contig?- Yes Canu solved it as a single contig with nanopore reads but it collapses in repeat regions or gives misassembly Supplementary- I can see deletions when raw reads are mapped to reference and visualized with IGV. I want to keep them as such rather than replacing them with bases from reference. There are almost 12000 files in coverage and 600 files in graphs. I am not sure which one is which. I hope you can find it after running my raw data Reference genome - https://www.ncbi.nlm.nih.gov/nuccore/NC_001493.2 I am attaching the raw data as a google drive link. Please let me know if you can't access it. S99-1170_IL_kr_R2_001_val_2.fq.gz https://drive.google.com/file/d/1G60HLZrDqhAIRHIN4ZF2V8pdzU2O9xb4/view?usp=drive_web S99-1170_IL_kr_R1_001_val_1.fq.gz https://drive.google.com/file/d/1W6ZL1q53edU3kgiO4hdJk3zzgLlQ2DbS/view?usp=drive_web

Arun Venugopalan Ph.D Scholar Infectious Diseases, Basic Science Dept. College of Veterinary Medicine Mississippi State University, USA

On Thu, Sep 30, 2021 at 7:41 AM Adrian Fritz @.***> wrote:

Thank you for the data, interesting to see. Is this real or simulated data? Is the QUAST reference genome exactly the one in the dataset/read set? Do other assemblers resolve the middle part in a single contig? Are there (two, multiple) strains in your data set or are you only interested in the repeat region? How does the flow of the contigs in the repeat regions look like? E.g. the flow for Il_for_contigs is 600-700, if the flow of the contigs in the repeat region is ~1200-1400 then this is a sign that Haploflow is also collapsing the repeats. If the flow is lower (also around 600-700), then the contigs for the rest of the repeat region might be even shorter. Haploflow does this to avoid mismatches/misassemblies if the coverage of two "strains" is close to 50% - which is what happens in a repeat region. Otherwise Haploflow might produce chimeric contigs.

Regarding the supplementary: The script produces full genomes of the strains: These have the base of the reference if not covered and of the contigs otherwise. For that you need the (Haploflow) contigs, a reference genome and a couple of files produced by QUAST: The SNP file, the Coords file and the report containing the Duplication ratio as well as the bam-mapping file. The contigs from Haploflow get clustered based on their flow into a number of clusters based on the Duplication ratio of QUAST (assuming multiple strains and one reference genome). If there is only one strain, the Duplication ratio (as in your report) is ~1, then there will be only one cluster and only one genome is produced. You could theoretically use this method with the contigs from Il_for_contigs and Il_rev_contigs and then should have both ends covered by contigs produced by Haploflow.

Looking at the Nanopore data it looks like following this up further does not seem promising for your dataset.

If you could provide the raw data I would be happy to have a look and see whether some different parameter set produces better results.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hzi-bifo/Haploflow/issues/10#issuecomment-931284943, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ2C6BMDGGIAXPCBPXU44TLUERLF5ANCNFSM5E75X7VA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

AlphaSquad commented 2 years ago

Interesting, I will have a look. The amount of files is indeed one thing which reduced between the 0.1 and 1.0 versions.

AlphaSquad commented 2 years ago

Could you also point me to the reference genome you used for QUAST, so our results look the same?

arunvv90 commented 2 years ago

Link to reference genome - https://www.ncbi.nlm.nih.gov/nuccore/NC_001493.2 Interestingly I tried with different k lengths. Comparison of results indicated the fragmented assembly. If I manually combine these contigs from different assemblies, I may get a complete genome. I was wondering the reason for this fragmentation. Is it repeats? Is the sample has different strains and a lot of mutations so that the assembler cannot derive consensus? Is there a way to get a complete genome? s1 s2 s3

report.pdf

arunvv90 commented 2 years ago

Sorry, I forgot to mention the filenames. I have used three datasets- forward reads only, reverse read-only and combined reads. File name starts with F means forward reads only data set and R means reverse reads only data set. Similarly, C indicates the combined reads. def in the file name indicates the default parameters are used for running haploflow. K51 in the file name means the K has been set to 51. canu indicates the assembly produced by the canu. I have tried from default 41 to 131 as k parameters.

AlphaSquad commented 2 years ago

The fragmented regions are indicators of repeat structures, yes. I don't know much about the HSV virus structure, but there seems to be a difficult to resolve region (repetitive?) in the middle as well which is of shorter length, enabling canu with the longer reads to resolve them (the bigger k seems only of some limited help). Looking at the coverage graph it does not seem to indicate that there are multiple viral strains present (red line denotes the erroneous k-mer threshold which probably could be increased, but I don't believe that it would improve quality much (if at all). HSV_coverage

Note that Haploflow was optimised to resolve closely related strains and will generally not produce consensus sequences but strain-resolved contigs for these. Resolving repeats unfortunately poses a quite different challenge since repeats generally cannot be resolved using their coverage. There are probably paths though, which Haploflow is not confident in following. If you like, I could try and increase Haploflow's "greediness" (not a parameter atm) so that Haploflow follows paths even if the relative coverage is around ~50%, but that would probably lead to chimeric contigs (and would take me a moment to implement).

arunvv90 commented 2 years ago

Thanks for the help. Please try the greediness parameter. If you think any other approaches to solve the repeats will be helpful, please let me know. I know it is very difficult to solve repeats. I am running behind it for the last 6 months. I really appreciate your interest in helping me. Here is the paper describe the type strain https://www.sciencedirect.com/science/article/pii/004268229290056U?via%3Dihub It may help to see the regions of repeats. I am attaching the figure of genome structure. It belongs to the same order of HC s4 M s5 V

AlphaSquad commented 2 years ago

There is an improved long contig mode with commit 3635bc1