Sample cmdlines - Githubissues

heshpdx commented 2 years ago

I am a computer architect, interested in benchmarking workloads for the sake of comparing different CPUs. I have used BLAST in my workload set in the past, but I would like to upgrade to what researchers in your field use these days, and that is how I came across VSEARCH. I feel VSEARCH seems to be a good compute-bound application that I could tune our future CPUs on.

I was able to download and build it for my architecture, and I also downloaded some FASTA files from this repository and others. However, I am not qualified to craft realistic command lines that people would use in real life (i can make something up but I would be flying blind). The documentation is very extensive but I cannot understand the meaning behind all the searches and transformations. Can you share some command lines that are relevant and realistic?

If you would like to get more involved in benchmark creation, there are some monetary rewards possible. thanks!

frederic-mahe commented 2 years ago

hello @heshpdx

What environment do you use to run your benchmarks? I usually create and test synthetic datasets with GNU tools and bash.

heshpdx commented 2 years ago

GNU tools and bash are great. We use gcc on many architectures under Linux. I would prefer real world datasets as opposed to synthetic one. For example, finding certain protein sequences in a mouse genome (is that even a good example?) is better than a random sequence which is then spliced or clustered.

On Sep 20, 2022, 1:11 AM, at 1:11 AM, "Frédéric Mahé" @.***> wrote:

hello @heshpdx

What environment do you use to run your benchmarks? I usually create and test synthetic datasets with GNU tools and bash.

-- Reply to this email directly or view it on GitHub: https://github.com/torognes/vsearch-data/issues/5#issuecomment-1251995469 You are receiving this because you were mentioned.

Message ID: @.***>

colinbrislawn commented 2 years ago

I'm a user, not a dev, but here's a command to get you started:

vsearch --allpairs_global <a fasta or fsa.gz file> --uc output_hits.uc --acceptall --threads <whatever>

Notes

If you don't need the large output file, add the flag --top_hits_only so the .uc file only lists the best hits.
If you use --threads 1 or sort the output_hits.uc file, the output is deterministic.

VSEARCH includes a ton of utility functions, but the killer feature is fast, exact alignment. That's why I'm suggesting --allpairs_global as it aligns each sequence in the fasta file with every other sequences, so it's O(N²).

The smallest fasta in this repo is AF091148.fsa, which is a real 16S gene. If that's too small / fast for your benchmark, you can pick a larger fasta file.

protein sequences in a mouse genome (is that even a good example?)

Close!

I use VSEARCH for short, highly similar nucleotide sequences all from the same gene. This is like matching a fingerprint to a database of other fingerprints.

For matching more divergent nucleotide sequences against a full genome, check out minimap2.

For protein sequence against a mouse genome, as in your example, check out DIAMOND.

heshpdx commented 2 years ago

Thanks so much for the tips! I downloaded some files from GPM, NIH, and also this repo. I tried to take some small and some medium size (the large ones run for quite a while)

$ ls -lgSG *fsa *fasta
-rw-rw-r-- 1 266164011 Sep 23 20:28 swissprot.fasta
-rw-rw-r-- 1  16391165 Sep 23 20:28 ipi.BOVIN.fasta
-rw-rw-r-- 1   6196283 Sep 23 20:28 Toxoplasma_gondii.fasta
-rw-rw-r-- 1   1240669 Sep 23 20:28 human_virus.fasta
-rw-rw-r-- 1    440854 Sep 23 20:28 Rfam_11_0.repr.fasta
-rw-rw-r-- 1    229809 Sep 23 20:28 AF091148.fsa

I have crafted these command lines, which total about 3 minutes of execution in single threaded mode, and of course are much quicker when using all cores. Please let me know if I am doing anything silly.

--allpairs_global AF091148.fsa --acceptall --top_hits_only --uc af.out.txt
--allpairs_global Rfam_11_0.repr.fasta --id 0.95 --top_hits_only --uc rfam.out.txt
--usearch_global  Toxoplasma_gondii.fasta --db swissprot.fasta --id 0.92 -alnout plasma1.out.txt
--usearch_global  human_virus.fasta --db swissprot.fasta --id 0.93 --top_hits_only --uc virus.out.txt
--cluster_size    ipi.BOVIN.fasta --id 0.9 --centroids bovin.out.txt
--orient          ipi.BOVIN.fasta --db swissprot.fasta --fastaout bovin.orient.txt

I tried cluster_size on swissprot.fasta and it was taking a long time while consuming 10+GB of memory. Maybe it's a good workload for a server processor!

colinbrislawn commented 2 years ago

I think you are off to a good start!

Just to check, are any of your input fasta files or database full of amino acid sequences (like maybe swissprot), or all these all nucleic acid DNA/RNA files? Vsearch only supports nucleic acid right now (https://github.com/torognes/vsearch/issues/42), so I wanted to double check.

While I'm checking on stuff, these future CPUs include vectorization hardware, right? 😉 I have no idea how (or if) the vectorised full dynamic programming algorithm would run on hardware without it.

--orient is an interesting addition, as it's the only command here does k-mer counting only, without the vectorized alignment. How this compares to --usearch_global, which includes alignment, should be interesting!

torognes commented 2 years ago

As @colinbrislawn wrote, vsearch does not support amino acid (protein) sequences, only nucleotide sequences (DNA and RNA). This means that swissprot.fasta and ipi.BOVIN.fasta are not usable. (SWISS-PROT=protein database established in Switzerland; IPI=International Protein Index.)

The fastx_uniques, cluster_size, uchime_denovo, usearch_global and allpairs_global commands are some of the most important and demanding commands in vsearch.

Large FASTQ files with sequencing reads can be downloaded from the SRA, the Sequence Read Archive, from NCBI (https://www.ncbi.nlm.nih.gov/sra) or EBI (https://www.ebi.ac.uk/ena/browser/home). These are used as query files.

Reference databases containing ribosomal RNA from many species, e.g. 16S rRNA, as in the SILVA database (https://www.arb-silva.de), is often used. Or the UNITE database (https://unite.ut.ee) with ITS sequences from fungi.

heshpdx commented 2 years ago

Here is some detailed info. I am representing the SPEC organization and we are searching for CPU bound workloads to make into the next generation CPU benchmarks. https://spec.org/cpuv8/

If you submit vsearch as a candidate, there are monetary rewards for your team and I would work with you directly to craft realistic command lines and make sure the code runs on many platforms. It's a great way to have an impact on the future hardware designs in the tech industry.

Here is the very simple submission form to start the process. https://www.spec.org/cpu/cpuv8/entry_form.html

On Sep 20, 2022, 1:11 AM, at 1:11 AM, "Frédéric Mahé" @.***> wrote:

hello @heshpdx

What environment do you use to run your benchmarks? I usually create and test synthetic datasets with GNU tools and bash.

-- Reply to this email directly or view it on GitHub: https://github.com/torognes/vsearch-data/issues/5#issuecomment-1251995469 You are receiving this because you were mentioned.

Message ID: @.***>

torognes commented 2 years ago

Thank you, @heshpdx, for the invitation and opportunity to contribute to the SPEC benchmarks. We will consider participating.

I have a question, though. Since some of the essential code for aligning sequences (found in align_simd.cc) is fairly low level (using SIMD intrinsics), won't the performance on different platforms to a large degree be dependent on our ability to write efficient code for each platform (x86, arm, ppc)? Of course the compiler and hardware is also very important. But wouldn't this be a disadvantage for code to benchmark CPUs?

heshpdx commented 2 years ago

[I had emailed my most recently comment about cpuv8 info back on Sep 20, so I'm not sure why it took so long to show up here.]

@torognes Over the previous few weeks, I contributed vsearch into the cpuv8 repository along with some simple cmdlines. I was hoping that your multi-dispatch shim layer in align_simd.cc (which is amazing BTW) would be able to bend the strict rules. Even though the current code can run on power, aarch64, and x86, at this point it doesn't run on riscv; and more importantly it won't run on whatever future ISA may be popular in 10 years. We thought about asking for a generic implementation of those functions, but at the end of the day that conflicts with the spirit of vsearch being a fast algorithm, and it wouldn't represent how users actually run vsearch. So we decided to drop investigating vsearch for that reason. @colinbrislawn pointed me to Diamond which I am investigating now. If you are still interested in helping craft genomics benchmarks, let me know. I need help verifying behaviors using different gcc optimization levels on multiple architectures. (why would a protein search return a different result with -O3 versus -O1?)

Thanks!

torognes commented 2 years ago

Ok, thanks for the explanation.

frederic-mahe commented 3 weeks ago

I am closing that issue for now. Feel free to re-open if need be.

torognes / vsearch-data

Sample cmdlines #5