ratschlab / metagraph

Scalable annotated de Bruijn graphs for DNA indexing, alignment, and assembly
http://metagraph.ethz.ch
GNU General Public License v3.0
111 stars 17 forks source link

Kmer width and Python API #399

Closed hrwaik closed 2 years ago

hrwaik commented 2 years ago

Hello,

I've been trialing Metagraph for some work on my current project. I just had a few quick questions regarding certain functions.

First, from the wiki regarding --count-width, if the value is set low does this cap the kmers to the lower bound, meaning any high abundance kmers would not be found? Does this essentially make low abundance taxa more easily searchable than if this is not set?

Second, regarding the Python API, how do I go about setting the new locations in the GraphClient command to search the created indices? I'm new to APIs so any help would be greatly appreciated.

Finally, most of the data I'm indexing are raw sequencing reads and so I build canonical graph, clean it and then build a final primary graph from the cleaned fasta. For any complete genomes/protein multifastas, I assume the cleaning step is not needed. Would you suggest building the graph from that without any cleaning?

Many thanks for your help, Huw

karasikov commented 2 years ago

Hi Huw,

First, from the wiki regarding --count-width, if the value is set low does this cap the kmers to the lower bound, meaning any high abundance kmers would not be found? Does this essentially make low abundance taxa more easily searchable than if this is not set?

--count-width sets the maximum count. So all k-mers with higher abundance will still be found, but their abundance will be reset to the maximum possible value corresponding to that width.

Second, regarding the Python API, how do I go about setting the new locations in the GraphClient command to search the created indices? I'm new to APIs so any help would be greatly appreciated.

I'm not sure I got what you mean by setting the new locations. There is an example (https://metagraph.ethz.ch/static/docs/api.html#example-of-search-in-metasub) in the API on how to connect to the metagraph server. If you run it locally, just pass localhost and <PORT> instead of the standard dnaloc.ethz.ch and port 80

For any complete genomes/protein multifastas, I assume the cleaning step is not needed. Would you suggest building the graph from that without any cleaning?

Correct. Graphs constructed from assembled sequences (genomes or contigs) don't require an additional cleaning as those constructed from raw sequencing reads. Also, when you build an index purely from assembled genomes (e.g., RefSeq), often you do know the orientation of the sequences, and hence, you can build that graph in the basic mode (non-canonical, non-primary).

Let me know if you have more questions!

hrwaik commented 2 years ago

Hi Mikhail,

Many thanks for your reply and apologies for my slow response.

Regarding the Python API, by new locations, I meant the locations of any indices I create. For instance, if the created index is located in an Amazon S3 bucket, I assume I replace dnaloc.ethz.ch with the S3 bucket location and the api_path is the path to the specific index, is that correct? I'm still unclear on determining the port number, is there any information regarding APIs you can suggest for me to look at for this?

Regarding the annotated genomes (ie protein multifastas), have you done any testing on changing kmer size influencing this? For building, I don't see an option for choosing the graph mode in the metagraph_Protein build options. Is the default option for this sufficient?

Many thanks for your help, Huw

karasikov commented 2 years ago

If that Amazon S3 bucket has a public IP, you should just provide that instead of dnaloc.ethz.ch. For the port, pass to metagraph server_query flag --port with the desired value (otherwise, the default 5555 will be used). In this setting, you don't need to pass any api_path, so that flag should be empty in the python API. (api_path is used to pass requests through another HTTP server, usually on port 80, redirecting them to other engines.)

Regarding the annotated genomes (ie protein multifastas), have you done any testing on changing kmer size influencing this? For building, I don't see an option for choosing the graph mode in the metagraph_Protein build options. Is the default option for this sufficient?

You can simply set the width to 32 bits, which should always be enough to represent the actual k-mer counts as they are without any rounding or clipping. As for the graph mode with Protein alphabets, it indeed only supports the basic mode, as canonical (and hence primary) is not applicable to protein sequences but only relevant to DNA.

karasikov commented 2 years ago

FYI: I added a short paragraph to the docs describing the setting with a locally hosted index. https://metagraph.ethz.ch/static/docs/api.html#query-a-locally-hosted-index

hrwaik commented 2 years ago

Sorry, for the delayed response. Thank you for the information, it was really useful.

I was just wondering if you could give some explanation to to the output format from the metagraph align command. For instance, how are the scores calculated and does this differ for DNA and protein sequences? I had a look in the online documentation but that part is still under development.

Many thanks, Huw

karasikov commented 2 years ago

@hmusta could you please comment on that?

hmusta commented 2 years ago

Sorry, for the delayed response. Thank you for the information, it was really useful.

I was just wondering if you could give some explanation to to the output format from the metagraph align command. For instance, how are the scores calculated and does this differ for DNA and protein sequences? I had a look in the online documentation but that part is still under development.

Many thanks, Huw

The output is in TSV format, with one line per query sequence. If there are N hits from a given query, then each line will have 2 + N 6 columns to report if no annotator is passed and 2 + N 7 if an annotator is passed. Assuming that an annotator is passed to the aligner, the columns are as follows, where 0 <= i < N.

Col Description
0 Query header
1 Query sequence
i*7 Orientation of the query sequence that was aligned to the graph
i*7+1 Alignment matching sequence
i*7+2 Alignment score (Smith-Waterman local alignment score with end bonus)
i*7+3 Number of exactly matching characters in the alignment
i*7+4 Alignment CIGAR (see page 8 of https://samtools.github.io/hts-specs/SAMv1.pdf)
i*7+5 Number of characters discarded from the prefix of the graph path to get the matching sequence
i*7+6 Annotations for the path

For DNA graphs, the scoring model is defined by the input arguments (match, mismatch, gap open, and gap extend penalties). For protein graphs, it defaults to using a BLOSUM62 scoring matrix for match/mismatch and user-set gap open and extend penalties.

I hope this description helps. Let me know if there's anything confusing.

hrwaik commented 2 years ago

That's very helpful, thanks. Just out of curiosity, do you know how ambiguous bases or amino acids will be handled by these scoring methods? I presume that for the DNA scoring it won't really matter as long as the bases match it should be fine, though for ambiguous amino acids being scored by BLOSUM62 seems less intuitive. Do you know how these are handled?

Many thanks

hmusta commented 2 years ago

Any character that's not one of ACGT (for DNA) or ARNDCQEGHILKMFPSTWYVBZX (for amino acids) is treated as a mismatch with the lowest score/greatest penalty.

In the BLOSUM62 table, there are indeed variable scores for matching different amino acids with an unknown amino acid. Even though I'm not 100% sure of this, I'd assume that they are derived from ground-truth alignments of amino acid sequences that feature unresolved/unknown amino acids.