Closed editorialbot closed 1 year ago
Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.
For a list of things I can do to help you, just type:
@editorialbot commands
For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:
@editorialbot generate pdf
Reference check summary (note 'MISSING' DOIs are suggestions that need verification):
OK DOIs
- 10.1093/bib/bbw089 is OK
- 10.1111/pbi.12499 is OK
- 10.1093/gigascience/giy125 is OK
- 10.1093/nar/gkp492 is OK
- 10.1186/1471-2105-12-242 is OK
- 10.1093/bioinformatics/btr011 is OK
- 10.1137/070685531 is OK
MISSING DOIs
- None
INVALID DOIs
- None
Software report:
github.com/AlDanial/cloc v 1.88 T=0.93 s (88.2 files/s, 75266.0 lines/s)
--------------------------------------------------------------------------------
Language files blank comment code
--------------------------------------------------------------------------------
Bourne Shell 10 4773 5619 29405
m4 9 1104 323 10470
C++ 17 506 1409 4437
C/C++ Header 15 604 4059 1872
TeX 15 268 203 1240
Markdown 5 499 0 1159
Bourne Again Shell 3 106 223 607
make 5 98 433 267
Perl 1 69 147 66
YAML 2 1 1 41
--------------------------------------------------------------------------------
SUM: 82 8028 12417 49564
--------------------------------------------------------------------------------
gitinspector failed to run statistical information for the repository
Wordcount for paper.md
is 1353
:warning: An error happened when generating the pdf.
@lpantano there was an issue generating the paper pdf. Is there a way of accessing the newest draft? I see there is one available on biorxiv, but I am unsure whether this is viable for review.
configure: error: in /home/ankush/Downloads/RedOak-master': configure: error: No suitable C++ MPI implementation found. See
config.log' for more details
I am getting the above message while configuring the tool
Hi, you need : A modern, C++11 ready compiler such as g++ version 4.9 or higher or clang version 3.2 or higher. A 64-bit operating system. Either Mac OS X or Linux are currently supported. OpenMPI libGkArrays-MPI To proprely install the tool. RedOak uses the libGkArrays-MPI library, which requires the zlib and openMPI development files to be already installed on your system (on Debian/Ubuntu it corresponds to the packages zlib1g-dev and libopenmpi-dev). (sudo apt install zlib1g-dev && sudo apt install libopenmpi-dev).
@cagret Is there a pdf version of the paper available? I saw one on biorxiv but I'm unsure if this is the latest version. However, the paper.md on the redoak repo seems incomplete in comparison.
@samhorsfield96 yes the one on biorxiv is up to date. I don't understand why the pdflatex on main.tex doesn't work. Can you please retry with sudo apt install textlive-full first and pdflatex main.tex ? Thank you, Regards, Clément
i updated the git repo : https://gite.lirmm.fr/doccy/RedOak/-/tree/master/RedOak-paper
i updated the git repo : https://gite.lirmm.fr/doccy/RedOak/-/tree/master/RedOak-paper
Unfortunately the in-text referencing hasn't worked. I shall review the latest biorxiv version.
@cagret Thank you for writing a great piece of software. The algorithm itself and the explanation of how it works is excellent, and the benchmarking is comprehensive and pushes Redoak to its limits. I have some minor comments below which hopefully explain my decisions on the checklist. Please let me know if anything is not clear.
Summary
Repository and software
Paper (reviewed paper here: https://www.biorxiv.org/content/10.1101/2020.12.19.423583v2.full.pdf)
Why am I getting number of indexed genome = 1 though I have specified three genomes. Command as follows, mpirun -n 9 redoak --genome GLA4_1000.fa,GP39_1000.fa,DHX2_1000.fa -k 27 -o output_sam.txt (File size (GLA4_1000.fa,GP39_1000.fa,DHX2_1000.fa) is shorten for testing) And output file is as follows (staring few lines)
GLA4_1000.fa,GP39_1000.fa,DHX2_1000.fa AAAAAAAAAAAAAAAATTAGCTCAGCC (1): 1 AAAAAAAAAAAAAAAATTCCCACGATG (1): 1 AAAAAAAAAAAAAAATTAGCTCAGCCT (1): 1 AAAAAAAAAAAAAAATTCCCACGATGA (1): 1 AAAAAAAAAAAAAATTAGCTCAGCCTA (1): 1 AAAAAAAAAAAAAATTCCCACGATGAT (1): 1 AAAAAAAAAAAAACCAAAAACATGAGA (1): 1 AAAAAAAAAAAAATTAGCTCAGCCTAG (1): 1
I have checked for all the lines in the output file, the 27-mers are different but ‘(1): 1’ this part is the same for all 27-mers. Can you help me with the understanding of output format?
screenshot as follows
Hello, maybe the doc isn't clear enought, and thanks you :
" "Examples:\v" " To index a single (compressed fasta) file without any extra information:\v" " --genome gen1.fa.gz\v" " To index several (compressed or not) files without any extra information:\v" " --genome gen2_chr1.fa,gen2_chr2.fa.gz,gen2_chr3.fa\v" " To index several files and provide a genome name (notice that the white" " spaces must be escaped or the whole argument must be enclosed within" " braquets):\v" " --genome My\ Third\ genome:gen3_chr1.fa,gen3_chr2.fa.gz\v" " To index several (fastq) files setting both a genome name and a genome" " version:\v" " --genome Unassembled\ genome:Version\ 0.1.2:sample1.fq,sample2.fq.gz\v" " To collect the sequence headers and length (be aware that it consumes a" " lot of time and memory):\v" " --genome ::+:sample5.fq.gz,sample6.fq.gz\v" " To index several files setting a genome name and a minimum number as well" " as a maximum number of occurrences for k-mers to be indexed (notice that" " the genome version is empty):\v" " --genome Another\ genome::[5,150]:sample3.fq,sample4.fq.gz\v" "
You need to use the option --genome for each genome. If you use the coma, redoak will think is it a same genome with several files.
In your exemple you need to to mpirun -n 9 redoak --genome GLA4_1000.fa --genome GP39_1000.fa --genome DHX2_1000.fa -k 27 -o output_sam.txt
Thank you for your feed back.
Clément
Now it is working fine with "mpirun -n 9 redoak --genome GLA4_1000.fa --genome GP39_1000.fa --genome DHX2_1000.fa -k 27 -o output_sam.txt". It is mentioned in the manuscript that for JellyFish, a small modification of the merge tool implementation and also a program has been created that simulates a parallelization of jobs. If you could provide this programme, it would be much easier for me to compare RedOak and JellyFish.
Sure how can i send you the "hacked version" ?
Using the hacked version : step 1: count the k-mers of the files with the count subcommand (well, no need to do it again, this has not been modified). step 2 : use the merge subcommand with the jellyfish binary dumps as input. The output file will look very similar to the one produced by RedOak (but the k-mers are not ordered).
We just merge the jellyfish dumps and measure the time it takes (the memory consumed is very low). This gives us a reliable way to check the validity of the RedOak outputs. We delete the header lines (starting with #) and sort what is left.
Like "tail -n +9 merged-output-from-jf.csv | sort > sorted-merge-output.csv". (then you just have to do an md5sum or a diff)
Another way to do it, sort without deleting the rows starting with # then do a diff ignoring the rows starting with #.
diff -ru jellyfish-2.2.10.orig/include/jellyfish/text_dumper.hpp jellyfish-2.2.10/include/jellyfish/text_dumper.hpp
--- jellyfish-2.2.10.orig/include/jellyfish/text_dumper.hpp 2018-03-29 21:28:08.000000000 +0200
+++ jellyfish-2.2.10/include/jellyfish/text_dumper.hpp 2019-05-14 12:57:23.647366563 +0200
@@ -26,6 +26,15 @@
void write(std::ostream& out, const Key& key, const Val val) {
out << key << " " << val << "\n";
}
+ void write(std::ostream& out, const Key& key, const Val val, const std::vector<bool> presence) {
+ out << key << " (" << val << "):";
+ for (std::vector<bool>::const_iterator it = presence.begin();
+ it != presence.end();
+ ++it) {
+ out << " " << *it;
+ }
+ out << "\n";
+ }
};
template<typename storage_t>
@@ -35,6 +44,7 @@
public:
static const char* format;
+ static const char* mergeformat;
text_dumper(int nb_threads, const char* file_prefix, file_header* header = 0) :
super(nb_threads, file_prefix, header)
@@ -54,6 +64,8 @@
};
template<typename storage_t>
const char* jellyfish::text_dumper<storage_t>::format = "text/sorted";
+template<typename storage_t>
+const char* jellyfish::text_dumper<storage_t>::mergeformat = "text/merged";
template<typename Key, typename Val>
class text_reader {
diff -ru jellyfish-2.2.10.orig/jellyfish/merge_files.cc jellyfish-2.2.10/jellyfish/merge_files.cc
--- jellyfish-2.2.10.orig/jellyfish/merge_files.cc 2018-03-29 21:28:08.000000000 +0200
+++ jellyfish-2.2.10/jellyfish/merge_files.cc 2019-05-14 13:00:17.719359425 +0200
@@ -50,12 +50,24 @@
typedef std::unique_ptr<RectangularBinaryMatrix> matrix_ptr;
template<typename reader_type, typename writer_type>
+void do_write(writer_type& writer, std::ostream &out, const mer_dna &key, uint64_t sum, const std::vector<bool>& presence) {
+ writer.write(out, key, sum);
+}
+
+template<>
+void do_write<binary_reader, text_writer>(text_writer& writer, std::ostream &out, const mer_dna &key, uint64_t sum, const std::vector<bool>& presence) {
+ writer.write(out, key, sum, presence);
+}
+
+template<typename reader_type, typename writer_type>
void do_merge(cpp_array<file_info>& files, std::ostream& out, writer_type& writer,
uint64_t min, uint64_t max) {
cpp_array<reader_type> readers(files.size());
typedef jellyfish::mer_heap::heap<mer_dna, reader_type> heap_type;
typedef typename heap_type::const_item_t heap_item;
heap_type heap(files.size());
+ std::vector<bool> presence;
+ presence.resize(files.size(), false);
for(size_t i = 0; i < files.size(); ++i) {
readers.init(i, files[i].is, &files[i].header);
@@ -66,17 +78,19 @@
heap_item head = heap.head();
mer_dna key;
while(heap.is_not_empty()) {
+ std::fill(presence.begin(), presence.end(), false);
key = head->key_;
uint64_t sum = 0;
do {
sum += head->val_;
+ presence[std::distance(&readers[0], head->it_)] = true;
heap.pop();
if(head->it_->next())
heap.push(*head->it_);
head = heap.head();
} while(head->key_ == key && heap.is_not_empty());
if(sum >= min && sum <= max)
- writer.write(out, key, sum);
+ do_write<reader_type, writer_type>(writer, out, key, sum, presence);
}
}
@@ -133,6 +147,7 @@
std::ofstream out(out_file);
if(!out.good())
throw MergeError(err::msg() << "Can't open out file '" << out_file << "'");
+ format = "text/merged";
out_header.format(format);
if(!format.compare(binary_dumper::format)) {
@@ -144,6 +159,24 @@
out_header.write(out);
text_writer writer;
do_merge<text_reader, text_writer>(files, out, writer, min, max);
+ } else if(!format.compare(text_dumper::mergeformat)) {
+ out << "# hostname: '" << out_header["hostname"] << "'\n"
+ << "# Execution time: " << out_header["time"] << "\n"
+ << "# Working Directory: '" << out_header["pwd"] << "'\n"
+ << "# Program: '" << out_header["exe_path"] << "'\n"
+ << "# CmdLine: '";
+ std::vector<std::string> cmds = out_header.cmdline();
+ for (std::vector<std::string>::const_iterator it = cmds.begin();
+ it != cmds.end();
+ ++it) {
+ out << " " << *it;
+ }
+ out << "'\n"
+ << "# Canonical: " << (out_header.canonical() ? "true" : "false") << "\n"
+ << "# Format: '" << out_header["format"] << "'\n"
+ << "# Key_len: " << out_header.key_len() << "'\n";
+ text_writer writer;
+ do_merge<binary_reader, text_writer>(files, out, writer, min, max);
} else {
throw MergeError(err::msg() << "Unknown format '" << format << "'");
}
Seulement dans jellyfish-2.2.10/jellyfish: merge_files.cc~
I have used the following commands (for two genomes) ./jellyfish count -m 27 -C -s 300M -t 9 -L 2 -o redoak_27_GP39 dataset/GP39.fa ./jellyfish count -m 27 -C -s 300M -t 9 -L 2 -o redoak_27_DHX2 dataset/DHX2.fa ./jellyfish merge -o mer_counts_merged.jf redoak_27_GP39 redoak_27_DHX2 May you assist me with such command lines to get the output in the desired format?
Did you change the source code of JellyFish ? By change i mean add the overload write function.
no
It will be much easier if you make the changes I sent you above.
Please find an attachement for the comments, Review_comments_final_RedOak.docx
@cagret, any comments for @swatimanekar ? @swatimanekar could you update your checklist please? It would be easier for me to help to move this forward.
@samhorsfield96, do you mind to take a look at your checklist and makes comment on what is needed for the missing point, so the author can make changes?
@cagret, could you update the paper so the reviewers can use the latest version for the review.
Thanks everybody.
Hi @lpantano, the summary of my review is in the comments above (https://github.com/openjournals/joss-reviews/issues/4363#issuecomment-1142333731), apologies if I didn't make this clear when I added the comment. I've made sure my checklist is in line with my summary. Please let me know if there is anything else you need from me.
Thank you @samhorsfield96. I will let @cagret and co-authors to work on them.
Hi @lpantano, the summary of my review is in the comments above (#4363 (comment)), apologies if I didn't make this clear when I added the comment. I've made sure my checklist is in line with my summary. Please let me know if there is anything else you need from me.
Hi, sorry for the delay, I thought of forwarding the mail but not to answer it. Thank for the feedback,
The "Statement of need" is in the readme, RedOak requires: A modern, C++11 ready compiler such as g++ version 4.9 or higher or clang version 3.2 or higher. A 64-bit operating system. Either Mac OS X or Linux are currently supported. OpenMPI libGkArrays-MPI Doxygen (recommanded but not mandatory)
We made the choice not to encapsulate our tool with Conda or Docker or other to have an optimal use of the available resources. This implies that the user must master the SGE or SLURM environment or at least MPI.
For the "mpirun -n 10 redoak ../test/test-0025_reads.fastq 25" i fixed this by ~/local_install/bin/mpirun -n 10 ~/local_install/bin/redoak --kmer 25 --genome resources/genome_ex1.fasta (thanks)
For the test on the 128 genomes : Is there a link so I can download the 128 genomes and try the indexing with RedOak? I did not find the link in https://pubmed.ncbi.nlm.nih.gov/27998823/
The output of query results goes to STDOUT use ">" to redirect into a file.
Paper (reviewed paper here: https://www.biorxiv.org/content/10.1101/2020.12.19.423583v2.full.pdf) TODO
Hi @cagret, many thanks for your reply.
@lpantano I have updated the checklist
Hi @cagret, I will wait for you to address the comments from @samhorsfield96. It seems it needs the statement of need to be added to the paper text. I agree that is more about the need of the tool to answer a scientific question. As well, @swatimanekar is missing more documentation to get to a better level? can you confirm this @swatimanekar? if yes, can you point again to some examples where it can be improved. Thanks
Yes. All those I have mentioned in my comments (commented on Jul 14).
Hi @cagret, do you have an idea on when you can work on the requests from the reviewers? it will give them an idea to be on top when they are done and try to speed up this. Thanks
@swatimanekar I fixed all the review from the docx, except the square which indicates the end of the theorem. I didI did not regenerate the pdf. For the main review i don't understand "Given the non-dedicated tools like jellyfish and BFT, it's not completely clear to me that the time and memory as resource comparisons are fair", I only see time and RAM to compare tools that only use time and memory. Knowing that we have fixed the number of CPU, I'm not sure what else to do. Thank you very much for the feedback.
@samhorsfield96 I have made changes on the statement of need. Tell me if it's ok now.
@lpantano I think it's ok
Thank you very much everyone
@editorialbot generate pdf
:warning: An error happened when generating the pdf.
@cagret can you check the error during compilation to make sure the document can be rendered. Or I can try to get help here from JOSS team. thanks
Latexmk: If appropriate, the -f option can be used to get latexmk to try to force complete processing. [makePDF] Run #1 Rc files read: NONE Latexmk: Run number 1 of rule 'lualatex' This is LuaHBTeX, Version 1.15.0 (TeX Live 2022) restricted system commands enabled. Latexmk: Getting log file '/tmp/tex2pdf.-00f6c558cb30663e/input.log' Collected error summary (may duplicate other messages): lualatex: Command for 'lualatex' gave return code 1 Refer to '/tmp/tex2pdf.-00f6c558cb30663e/input.log' for details
Error producing PDF. ! Missing $ inserted.
@editorialbot generate pdf
:warning: An error happened when generating the pdf.
@editorialbot generate pdf
:warning: An error happened when generating the pdf.
@editorialbot commands
Hello @cagret, here are the things you can ask me to do:
# List all available commands
@editorialbot commands
# Get a list of all editors's GitHub handles
@editorialbot list editors
# Check the references of the paper for missing DOIs
@editorialbot check references
# Perform checks on the repository
@editorialbot check repository
# Adds a checklist for the reviewer using this command
@editorialbot generate my checklist
# Set a value for branch
@editorialbot set joss-paper as branch
# Generates the pdf paper
@editorialbot generate pdf
# Get a link to the complete list of reviewers
@editorialbot list reviewers
@editorialbot check repository
Software report:
github.com/AlDanial/cloc v 1.88 T=0.21 s (406.4 files/s, 339036.7 lines/s)
--------------------------------------------------------------------------------
Language files blank comment code
--------------------------------------------------------------------------------
Bourne Shell 10 4773 5619 29405
m4 9 1104 323 10470
C++ 17 506 1409 4437
C/C++ Header 15 604 4059 1872
TeX 16 278 224 1244
Markdown 5 498 0 1160
Bourne Again Shell 3 106 223 607
make 5 98 433 267
Perl 1 69 147 66
YAML 3 3 5 59
--------------------------------------------------------------------------------
SUM: 84 8039 12442 49587
--------------------------------------------------------------------------------
gitinspector failed to run statistical information for the repository
Wordcount for paper.md
is 1399
@editorialbot generate pdf
Submitting author: !--author-handle-->@cagret<!--end-author-handle-- (Clement AGRET) Repository: https://gite.lirmm.fr/doccy/RedOak Branch with paper.md (empty if default branch): Version: v0.1.5 Editor: !--editor-->@lpantano<!--end-editor-- Reviewers: @swatimanekar, @samhorsfield96 Archive: 10.6084/m9.figshare.21711767
Status
Status badge code:
Reviewers and authors:
Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)
Reviewer instructions & questions
@swatimanekar & @samhorsfield96, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review. First of all you need to run this command in a separate comment to create the checklist:
The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @lpantano know.
✨ Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest ✨
Checklists
📝 Checklist for @swatimanekar
📝 Checklist for @samhorsfield96