openjournals / joss-reviews

Reviews for the Journal of Open Source Software
Creative Commons Zero v1.0 Universal
718 stars 38 forks source link

[REVIEW]: RedOak: a reference-free and alignment-free structure for indexing a collection of similar genomes #4363

Closed editorialbot closed 1 year ago

editorialbot commented 2 years ago

Submitting author: !--author-handle-->@cagret<!--end-author-handle-- (Clement AGRET) Repository: https://gite.lirmm.fr/doccy/RedOak Branch with paper.md (empty if default branch): Version: v0.1.5 Editor: !--editor-->@lpantano<!--end-editor-- Reviewers: @swatimanekar, @samhorsfield96 Archive: 10.6084/m9.figshare.21711767

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/1d4506a2b9e02c0d4f2f547673ff25f1"><img src="https://joss.theoj.org/papers/1d4506a2b9e02c0d4f2f547673ff25f1/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/1d4506a2b9e02c0d4f2f547673ff25f1/status.svg)](https://joss.theoj.org/papers/1d4506a2b9e02c0d4f2f547673ff25f1)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@swatimanekar & @samhorsfield96, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @lpantano know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @swatimanekar

📝 Checklist for @samhorsfield96

editorialbot commented 2 years ago

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf
editorialbot commented 2 years ago
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1093/bib/bbw089 is OK
- 10.1111/pbi.12499 is OK
- 10.1093/gigascience/giy125 is OK
- 10.1093/nar/gkp492 is OK
- 10.1186/1471-2105-12-242 is OK
- 10.1093/bioinformatics/btr011 is OK
- 10.1137/070685531 is OK

MISSING DOIs

- None

INVALID DOIs

- None
editorialbot commented 2 years ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.93 s (88.2 files/s, 75266.0 lines/s)
--------------------------------------------------------------------------------
Language                      files          blank        comment           code
--------------------------------------------------------------------------------
Bourne Shell                     10           4773           5619          29405
m4                                9           1104            323          10470
C++                              17            506           1409           4437
C/C++ Header                     15            604           4059           1872
TeX                              15            268            203           1240
Markdown                          5            499              0           1159
Bourne Again Shell                3            106            223            607
make                              5             98            433            267
Perl                              1             69            147             66
YAML                              2              1              1             41
--------------------------------------------------------------------------------
SUM:                             82           8028          12417          49564
--------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 2 years ago

Wordcount for paper.md is 1353

editorialbot commented 2 years ago

:warning: An error happened when generating the pdf.

samhorsfield96 commented 2 years ago

Review checklist for @samhorsfield96

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

samhorsfield96 commented 2 years ago

@lpantano there was an issue generating the paper pdf. Is there a way of accessing the newest draft? I see there is one available on biorxiv, but I am unsure whether this is viable for review.

swatimanekar commented 2 years ago

configure: error: in /home/ankush/Downloads/RedOak-master': configure: error: No suitable C++ MPI implementation found. Seeconfig.log' for more details

I am getting the above message while configuring the tool

swatimanekar commented 2 years ago

Review checklist for @swatimanekar

Conflict of interest

Code of Conduct

General checks

Functionality

Documentation

Software paper

cagret commented 2 years ago

Hi, you need : A modern, C++11 ready compiler such as g++ version 4.9 or higher or clang version 3.2 or higher. A 64-bit operating system. Either Mac OS X or Linux are currently supported. OpenMPI libGkArrays-MPI To proprely install the tool. RedOak uses the libGkArrays-MPI library, which requires the zlib and openMPI development files to be already installed on your system (on Debian/Ubuntu it corresponds to the packages zlib1g-dev and libopenmpi-dev). (sudo apt install zlib1g-dev && sudo apt install libopenmpi-dev).

samhorsfield96 commented 2 years ago

@cagret Is there a pdf version of the paper available? I saw one on biorxiv but I'm unsure if this is the latest version. However, the paper.md on the redoak repo seems incomplete in comparison.

cagret commented 2 years ago

@samhorsfield96 yes the one on biorxiv is up to date. I don't understand why the pdflatex on main.tex doesn't work. Can you please retry with sudo apt install textlive-full first and pdflatex main.tex ? Thank you, Regards, Clément

cagret commented 2 years ago

i updated the git repo : https://gite.lirmm.fr/doccy/RedOak/-/tree/master/RedOak-paper

samhorsfield96 commented 2 years ago

i updated the git repo : https://gite.lirmm.fr/doccy/RedOak/-/tree/master/RedOak-paper

Unfortunately the in-text referencing hasn't worked. I shall review the latest biorxiv version.

samhorsfield96 commented 2 years ago

@cagret Thank you for writing a great piece of software. The algorithm itself and the explanation of how it works is excellent, and the benchmarking is comprehensive and pushes Redoak to its limits. I have some minor comments below which hopefully explain my decisions on the checklist. Please let me know if anything is not clear.

Summary

Repository and software

Paper (reviewed paper here: https://www.biorxiv.org/content/10.1101/2020.12.19.423583v2.full.pdf)

swatimanekar commented 2 years ago

Why am I getting number of indexed genome = 1 though I have specified three genomes. Command as follows, mpirun -n 9 redoak --genome GLA4_1000.fa,GP39_1000.fa,DHX2_1000.fa -k 27 -o output_sam.txt (File size (GLA4_1000.fa,GP39_1000.fa,DHX2_1000.fa) is shorten for testing) And output file is as follows (staring few lines)

GLA4_1000.fa,GP39_1000.fa,DHX2_1000.fa AAAAAAAAAAAAAAAATTAGCTCAGCC (1): 1 AAAAAAAAAAAAAAAATTCCCACGATG (1): 1 AAAAAAAAAAAAAAATTAGCTCAGCCT (1): 1 AAAAAAAAAAAAAAATTCCCACGATGA (1): 1 AAAAAAAAAAAAAATTAGCTCAGCCTA (1): 1 AAAAAAAAAAAAAATTCCCACGATGAT (1): 1 AAAAAAAAAAAAACCAAAAACATGAGA (1): 1 AAAAAAAAAAAAATTAGCTCAGCCTAG (1): 1

I have checked for all the lines in the output file, the 27-mers are different but ‘(1): 1’ this part is the same for all 27-mers. Can you help me with the understanding of output format?

screenshot as follows

redoak

cagret commented 2 years ago

Hello, maybe the doc isn't clear enought, and thanks you :

" "Examples:\v" " To index a single (compressed fasta) file without any extra information:\v" " --genome gen1.fa.gz\v" " To index several (compressed or not) files without any extra information:\v" " --genome gen2_chr1.fa,gen2_chr2.fa.gz,gen2_chr3.fa\v" " To index several files and provide a genome name (notice that the white" " spaces must be escaped or the whole argument must be enclosed within" " braquets):\v" " --genome My\ Third\ genome:gen3_chr1.fa,gen3_chr2.fa.gz\v" " To index several (fastq) files setting both a genome name and a genome" " version:\v" " --genome Unassembled\ genome:Version\ 0.1.2:sample1.fq,sample2.fq.gz\v" " To collect the sequence headers and length (be aware that it consumes a" " lot of time and memory):\v" " --genome ::+:sample5.fq.gz,sample6.fq.gz\v" " To index several files setting a genome name and a minimum number as well" " as a maximum number of occurrences for k-mers to be indexed (notice that" " the genome version is empty):\v" " --genome Another\ genome::[5,150]:sample3.fq,sample4.fq.gz\v" "

You need to use the option --genome for each genome. If you use the coma, redoak will think is it a same genome with several files.

In your exemple you need to to mpirun -n 9 redoak --genome GLA4_1000.fa --genome GP39_1000.fa --genome DHX2_1000.fa -k 27 -o output_sam.txt

Thank you for your feed back.

Clément

swatimanekar commented 2 years ago

Now it is working fine with "mpirun -n 9 redoak --genome GLA4_1000.fa --genome GP39_1000.fa --genome DHX2_1000.fa -k 27 -o output_sam.txt". It is mentioned in the manuscript that for JellyFish, a small modification of the merge tool implementation and also a program has been created that simulates a parallelization of jobs. If you could provide this programme, it would be much easier for me to compare RedOak and JellyFish.

cagret commented 2 years ago

Sure how can i send you the "hacked version" ?

cagret commented 2 years ago

Using the hacked version : step 1: count the k-mers of the files with the count subcommand (well, no need to do it again, this has not been modified). step 2 : use the merge subcommand with the jellyfish binary dumps as input. The output file will look very similar to the one produced by RedOak (but the k-mers are not ordered).

We just merge the jellyfish dumps and measure the time it takes (the memory consumed is very low). This gives us a reliable way to check the validity of the RedOak outputs. We delete the header lines (starting with #) and sort what is left.

Like "tail -n +9 merged-output-from-jf.csv | sort > sorted-merge-output.csv". (then you just have to do an md5sum or a diff)

Another way to do it, sort without deleting the rows starting with # then do a diff ignoring the rows starting with #.

cagret commented 2 years ago
diff -ru jellyfish-2.2.10.orig/include/jellyfish/text_dumper.hpp jellyfish-2.2.10/include/jellyfish/text_dumper.hpp
--- jellyfish-2.2.10.orig/include/jellyfish/text_dumper.hpp 2018-03-29 21:28:08.000000000 +0200
+++ jellyfish-2.2.10/include/jellyfish/text_dumper.hpp  2019-05-14 12:57:23.647366563 +0200
@@ -26,6 +26,15 @@
   void write(std::ostream& out, const Key& key, const Val val) {
    out << key << " " << val << "\n";
   }
+  void write(std::ostream& out, const Key& key, const Val val, const std::vector<bool> presence) {
+    out << key << " (" << val << "):";
+    for (std::vector<bool>::const_iterator it = presence.begin();
+         it != presence.end();
+         ++it) {
+      out << " " << *it;
+    }
+    out << "\n";    
+  }
 };

 template<typename storage_t>
@@ -35,6 +44,7 @@

 public:
   static const char* format;
+  static const char* mergeformat;

   text_dumper(int nb_threads, const char* file_prefix, file_header* header = 0) :
     super(nb_threads, file_prefix, header)
@@ -54,6 +64,8 @@
 };
 template<typename storage_t>
 const char* jellyfish::text_dumper<storage_t>::format = "text/sorted";
+template<typename storage_t>
+const char* jellyfish::text_dumper<storage_t>::mergeformat = "text/merged";

 template<typename Key, typename Val>
 class text_reader {
diff -ru jellyfish-2.2.10.orig/jellyfish/merge_files.cc jellyfish-2.2.10/jellyfish/merge_files.cc
--- jellyfish-2.2.10.orig/jellyfish/merge_files.cc  2018-03-29 21:28:08.000000000 +0200
+++ jellyfish-2.2.10/jellyfish/merge_files.cc   2019-05-14 13:00:17.719359425 +0200
@@ -50,12 +50,24 @@
 typedef std::unique_ptr<RectangularBinaryMatrix> matrix_ptr;

 template<typename reader_type, typename writer_type>
+void do_write(writer_type& writer, std::ostream &out, const mer_dna &key, uint64_t sum, const std::vector<bool>& presence) {
+  writer.write(out, key, sum);
+}
+
+template<>
+void do_write<binary_reader, text_writer>(text_writer& writer, std::ostream &out, const mer_dna &key, uint64_t sum, const std::vector<bool>& presence) {
+  writer.write(out, key, sum, presence);
+}
+
+template<typename reader_type, typename writer_type>
 void do_merge(cpp_array<file_info>& files, std::ostream& out, writer_type& writer,
               uint64_t min, uint64_t max) {
   cpp_array<reader_type> readers(files.size());
   typedef jellyfish::mer_heap::heap<mer_dna, reader_type> heap_type;
   typedef typename heap_type::const_item_t heap_item;
   heap_type heap(files.size());
+  std::vector<bool> presence;
+  presence.resize(files.size(), false);

   for(size_t i = 0; i < files.size(); ++i) {
     readers.init(i, files[i].is, &files[i].header);
@@ -66,17 +78,19 @@
   heap_item head = heap.head();
   mer_dna   key;
   while(heap.is_not_empty()) {
+    std::fill(presence.begin(), presence.end(), false);
     key = head->key_;
     uint64_t sum = 0;
     do {
       sum += head->val_;
+      presence[std::distance(&readers[0], head->it_)] = true;
       heap.pop();
       if(head->it_->next())
         heap.push(*head->it_);
       head = heap.head();
     } while(head->key_ == key && heap.is_not_empty());
     if(sum >= min && sum <= max)
-      writer.write(out, key, sum);
+      do_write<reader_type, writer_type>(writer, out, key, sum, presence);
   }
 }

@@ -133,6 +147,7 @@
   std::ofstream out(out_file);
   if(!out.good())
     throw MergeError(err::msg() << "Can't open out file '" << out_file << "'");
+  format = "text/merged";
   out_header.format(format);

   if(!format.compare(binary_dumper::format)) {
@@ -144,6 +159,24 @@
     out_header.write(out);
     text_writer writer;
     do_merge<text_reader, text_writer>(files, out, writer, min, max);
+  } else if(!format.compare(text_dumper::mergeformat)) {
+      out << "# hostname: '" << out_header["hostname"] << "'\n"
+          << "# Execution time: " << out_header["time"] << "\n"
+          << "# Working Directory: '" << out_header["pwd"] << "'\n"
+          << "# Program: '" << out_header["exe_path"] << "'\n"
+          << "# CmdLine: '";
+      std::vector<std::string> cmds = out_header.cmdline();
+      for (std::vector<std::string>::const_iterator it = cmds.begin();
+           it != cmds.end();
+           ++it) {
+        out << " " << *it;
+      }
+      out << "'\n"
+          << "# Canonical: " << (out_header.canonical() ? "true" : "false") << "\n"
+          << "# Format: '" << out_header["format"] << "'\n"
+          << "# Key_len: " << out_header.key_len() << "'\n";
+    text_writer writer;
+    do_merge<binary_reader, text_writer>(files, out, writer, min, max);
   } else {
     throw MergeError(err::msg() << "Unknown format '" << format << "'");
   }
Seulement dans jellyfish-2.2.10/jellyfish: merge_files.cc~
swatimanekar commented 2 years ago

I have used the following commands (for two genomes) ./jellyfish count -m 27 -C -s 300M -t 9 -L 2 -o redoak_27_GP39 dataset/GP39.fa ./jellyfish count -m 27 -C -s 300M -t 9 -L 2 -o redoak_27_DHX2 dataset/DHX2.fa ./jellyfish merge -o mer_counts_merged.jf redoak_27_GP39 redoak_27_DHX2 May you assist me with such command lines to get the output in the desired format?

cagret commented 2 years ago

Did you change the source code of JellyFish ? By change i mean add the overload write function.

swatimanekar commented 2 years ago

no

cagret commented 2 years ago

It will be much easier if you make the changes I sent you above.

swatimanekar commented 2 years ago

Please find an attachement for the comments, Review_comments_final_RedOak.docx

lpantano commented 2 years ago

@cagret, any comments for @swatimanekar ? @swatimanekar could you update your checklist please? It would be easier for me to help to move this forward.

@samhorsfield96, do you mind to take a look at your checklist and makes comment on what is needed for the missing point, so the author can make changes?

@cagret, could you update the paper so the reviewers can use the latest version for the review.

Thanks everybody.

samhorsfield96 commented 2 years ago

Hi @lpantano, the summary of my review is in the comments above (https://github.com/openjournals/joss-reviews/issues/4363#issuecomment-1142333731), apologies if I didn't make this clear when I added the comment. I've made sure my checklist is in line with my summary. Please let me know if there is anything else you need from me.

lpantano commented 2 years ago

Thank you @samhorsfield96. I will let @cagret and co-authors to work on them.

cagret commented 2 years ago

Hi @lpantano, the summary of my review is in the comments above (#4363 (comment)), apologies if I didn't make this clear when I added the comment. I've made sure my checklist is in line with my summary. Please let me know if there is anything else you need from me.

Hi, sorry for the delay, I thought of forwarding the mail but not to answer it. Thank for the feedback,

Paper (reviewed paper here: https://www.biorxiv.org/content/10.1101/2020.12.19.423583v2.full.pdf) TODO

samhorsfield96 commented 2 years ago

Hi @cagret, many thanks for your reply.

swatimanekar commented 2 years ago

@lpantano I have updated the checklist

lpantano commented 2 years ago

Hi @cagret, I will wait for you to address the comments from @samhorsfield96. It seems it needs the statement of need to be added to the paper text. I agree that is more about the need of the tool to answer a scientific question. As well, @swatimanekar is missing more documentation to get to a better level? can you confirm this @swatimanekar? if yes, can you point again to some examples where it can be improved. Thanks

swatimanekar commented 2 years ago

Yes. All those I have mentioned in my comments (commented on Jul 14).

lpantano commented 2 years ago

Hi @cagret, do you have an idea on when you can work on the requests from the reviewers? it will give them an idea to be on top when they are done and try to speed up this. Thanks

cagret commented 2 years ago

@swatimanekar I fixed all the review from the docx, except the square which indicates the end of the theorem. I didI did not regenerate the pdf. For the main review i don't understand "Given the non-dedicated tools like jellyfish and BFT, it's not completely clear to me that the time and memory as resource comparisons are fair", I only see time and RAM to compare tools that only use time and memory. Knowing that we have fixed the number of CPU, I'm not sure what else to do. Thank you very much for the feedback.

@samhorsfield96 I have made changes on the statement of need. Tell me if it's ok now.

@lpantano I think it's ok

Thank you very much everyone

lpantano commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:warning: An error happened when generating the pdf.

lpantano commented 2 years ago

@cagret can you check the error during compilation to make sure the document can be rendered. Or I can try to get help here from JOSS team. thanks

cagret commented 2 years ago

Latexmk: If appropriate, the -f option can be used to get latexmk to try to force complete processing. [makePDF] Run #1 Rc files read: NONE Latexmk: Run number 1 of rule 'lualatex' This is LuaHBTeX, Version 1.15.0 (TeX Live 2022) restricted system commands enabled. Latexmk: Getting log file '/tmp/tex2pdf.-00f6c558cb30663e/input.log' Collected error summary (may duplicate other messages): lualatex: Command for 'lualatex' gave return code 1 Refer to '/tmp/tex2pdf.-00f6c558cb30663e/input.log' for details

Error producing PDF. ! Missing $ inserted.

$ l.540 reduce the final score by \frac{l-k}{l}
cagret commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:warning: An error happened when generating the pdf.

cagret commented 2 years ago

@editorialbot generate pdf

editorialbot commented 2 years ago

:warning: An error happened when generating the pdf.

cagret commented 2 years ago

@editorialbot commands

editorialbot commented 2 years ago

Hello @cagret, here are the things you can ask me to do:


# List all available commands
@editorialbot commands

# Get a list of all editors's GitHub handles
@editorialbot list editors

# Check the references of the paper for missing DOIs
@editorialbot check references

# Perform checks on the repository
@editorialbot check repository

# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist

# Set a value for branch
@editorialbot set joss-paper as branch

# Generates the pdf paper
@editorialbot generate pdf

# Get a link to the complete list of reviewers
@editorialbot list reviewers
cagret commented 2 years ago

@editorialbot check repository

editorialbot commented 2 years ago
Software report:

github.com/AlDanial/cloc v 1.88  T=0.21 s (406.4 files/s, 339036.7 lines/s)
--------------------------------------------------------------------------------
Language                      files          blank        comment           code
--------------------------------------------------------------------------------
Bourne Shell                     10           4773           5619          29405
m4                                9           1104            323          10470
C++                              17            506           1409           4437
C/C++ Header                     15            604           4059           1872
TeX                              16            278            224           1244
Markdown                          5            498              0           1160
Bourne Again Shell                3            106            223            607
make                              5             98            433            267
Perl                              1             69            147             66
YAML                              3              3              5             59
--------------------------------------------------------------------------------
SUM:                             84           8039          12442          49587
--------------------------------------------------------------------------------

gitinspector failed to run statistical information for the repository
editorialbot commented 2 years ago

Wordcount for paper.md is 1399

cagret commented 2 years ago

@editorialbot generate pdf