Memory consumption for extremely large search spaces

grosenberger commented 8 months ago

Hi Michael,

When using Sage with very large search spaces (e.g. many PTMs, non-specific digestion, etc.), the memory consumption frequently goes beyond the available resources on standard workstations. In such scenarios, MSFragger partitions the search space and iteratively processes them.

I was wondering whether similar functionality would be possible to implement in Sage. For example, a "batch size" parameter could be manually set (or estimated based on available memory) to partition the search space. I think there are several options on how this could be implemented, one option could be to partition candidate peptide precursors based on precursor m/z and have different partitions for different spaces. For DIA, this could correspond to the precursor isolation windows, for DDA, it might make sense to just select the range according to batch size. The main search algorithm could then iterate over the partitions for scoring and the individual partitions would be assembled before ML and statistical validation. The search space could be generated according to partitions on-the-fly and kept in memory, or alternatively, also exported to disk (similar to how MSFragger does it).

How do you think about this options? Would there be a preferred solution?

Best regards, George

lazear commented 8 months ago

Hi George,

I agree that it's a necessity for large search spaces. I have been messing around with some internal database splitting, but it's not ready for prime-time yet.

In the mean time, it's possible to perform external database splitting - generate slices of FASTA files and run Sage multiple times, then combine the results and rescore. Perhaps not ideal, but this is essentially what would be done with internal database splitting as well. See below for an example python script for accomplishing this.

import subprocess
import pandas as pd
from Bio import SeqIO

SLICES = 5
records = []
for record in SeqIO.parse("fasta/human_contaminant.fasta", format="fasta"):
    records.append(record)

N = len(records) // SLICES
for i in range(SLICES):
    with open(f"fasta/human_slice_{i}.fasta", "w") as f:
        for record in records[i * N : (i + 1) * N]:
            SeqIO.write(record, f, format="fasta")
    cmd = [
        "sage",
        "search.json",
        "-o",
        f"semi_{i}",
        "-f",
        f"fasta/human_slice_{i}.fasta",
        "--write-pin",
        "HeLa_chytry_HCD_1.mzML.gz",
    ]
    subprocess.run(cmd)

dfs = []
for i in range(SLICES):
    dfs.append(pd.read_csv(f"semi_{i}/results.sage.pin", sep="\t"))

pd.concat(dfs).sort_values(by="ln(hyperscore)", ascending=False).drop_duplicates(
    subset=["FileName", "ScanNr"], keep="first"
).to_csv("sliced.pin", sep="\t")

patrick-willems commented 8 months ago

Hey, just a question related to this issue. Could it be that by sorting on the hyperscore and only retaining the best match you might lose hits (also not compatible with chimera searching)? Would a valid alternative be to split up the searches in terms of precursor m/z in consecutive searches (but those on whole FASTA instead of FASTA splitting)? I once tried it (by making alternative JSONs in a loop) but the memory consumption did not decrease.

grosenberger commented 8 months ago

Thanks for the feedback! We have been using similar workarounds before. FragPipe also uses similar mechanisms for very large databases.

lazear commented 8 months ago

Hey, just a question related to this issue. Could it be that by sorting on the hyperscore and only retaining the best match you might lose hits (also not compatible with chimera searching)? Would a valid alternative be to split up the searches in terms of precursor m/z in consecutive searches (but those on whole FASTA instead of FASTA splitting)? I once tried it (by making alternative JSONs in a loop) but the memory consumption did not decrease.

Interesting that this didn't decrease memory consumption - setting the peptide_min_mass and peptide_max_mass will restrict down the # of final peptides kept and fragments generated (filter is applied after digestion).

That is a valid point about chimeric searches, but those are already kind of heuristic (subtractive method vs something potentially smarter). One potential alternative would be to pre-digest the FASTA database (and pass in "$" as the cleavage enzyme to Sage), and then chunk the FASTA database by peptide mass. That should help with improving chimeric searches and possibly make it go faster as well - this is basically what would be implemented internally.

lazear / sage

Memory consumption for extremely large search spaces #97