scverse / scvi-tools

Deep probabilistic analysis of single-cell and spatial omics data
http://scvi-tools.org/
BSD 3-Clause "New" or "Revised" License
1.24k stars 350 forks source link

Add the motif_score method of scBasset in scvi-tools #1952

Closed Starlitnightly closed 1 year ago

Starlitnightly commented 1 year ago

Thank you for your effort on implement ScBasset.

Can a similar approach to scbasset.utils.motif_score also be integrated into scvi-tools in the future?

The implementation in scbasset is very simple, the [-8] layer of the model is used as the output of the model and then simple arithmetic is performed.

I want to try to extract the corresponding layer from scvi's model (torch) like keras for prediction, but I failed, I can't find where the model parameters are stored. So is it possible to add the motif_score function in the next version?

method link: https://github.com/calico/scBasset/blob/c15bec3a73fa1e04822db723338d234ca9d384ce/scbasset/utils.py#L453

def motif_score(tf, model, motif_fasta_folder, bc=False):
    """score motifs for any given TF.
    Args:
        tf:             TF of interest. By default we only provide TFs to score in
                        https://storage.googleapis.com/scbasset_tutorial_data/Homo_sapiens_motif_fasta.tar.gz.
                        To score on additional motifs, follow make_fasta.R in the tarball 
                        to create dinucleotide shuffled sequences with and without motifs of
                        interest.
        model:          a trained scBasset model.
        motif_fasta_folder: folder for dinucleotide shuffled sequences with and without any motif.
                        We provided motifs from CIS-BP/Homo_sapiens.meme downloaded from the
                        MEME Suite (https://meme-suite.org/meme/) in 
                        https://storage.googleapis.com/scbasset_tutorial_data/Homo_sapiens_motif_fasta.tar.gz.
    Returns:
        array:          a vector for motif activity per cell. (cell order is the
                        same order as the model.)
    """
    fasta_motif = "%s/shuffled_peaks_motifs/%s.fasta" % (motif_fasta_folder, tf)
    fasta_bg = "%s/shuffled_peaks.fasta" % motif_fasta_folder

    pred_motif = pred_on_fasta(fasta_motif, model, bc=bc)
    pred_bg = pred_on_fasta(fasta_bg, model, bc=bc)
    tf_score = pred_motif.mean(axis=0) - pred_bg.mean(axis=0)
    tf_score = (tf_score - tf_score.mean()) / tf_score.std()
    return tf_score
def pred_on_fasta(fa, model, bc=False, scale_method=None):
    """Run a trained model on a fasta file.
    Args:
        fa:             fasta file to run on. Need to have a fixed size of 1344. Default
                        sequence size of trained model.
        model:          a trained scBasset model.
    Returns:
        array:          a peak*cell imputed accessibility matrix. Sequencing depth corrected for.
    """
    records = list(SeqIO.parse(fa, "fasta"))
    seqs = [str(i.seq) for i in records]
    seqs_1hot = np.array([dna_1hot(i) for i in seqs])
    pred = imputation_Y_normalize(seqs_1hot, model, bc_model=bc, scale_method=scale_method)
    return pred
# perform imputation. Depth normalized.
def imputation_Y_normalize(X, model, bc_model=False, scale_method=None):
    """Perform imputation. Normalize for depth.
    Args:
        X:              feature matrix from h5.
        model:          a trained scBasset model.
    Returns:
        array:          a peak*cell imputed accessibility matrix. Sequencing depth corrected
                        for. scale_method=None, don't do any scaling of output. The raw
                        normalized output would have both positive and negative values.
                        scale_method="all_positive" scales the output by subtracting minimum value.
                        scale_method="sigmoid" scales the output by sigmoid transform.
    """
    if bc_model:
        new_model = tf.keras.Model(
            inputs=model.layers[0].input,
            outputs=model.layers[-8].output,
        )
        Y_pred = new_model.predict(X)
        w = model.layers[-6].get_weights()[0]
        accessibility_norm = np.dot(Y_pred.squeeze(), w)

    else:
        new_model = tf.keras.Model(
            inputs=model.layers[0].input,
            outputs=model.layers[-4].output,
        )
        Y_pred = new_model.predict(X)
        w = model.layers[-3].get_weights()[0]
        accessibility_norm = np.dot(Y_pred.squeeze(), w)

    # scaling
    if scale_method == "positive":
        accessibility_norm = accessibility_norm - np.min(accessibility_norm)
    if scale_method == "sigmoid":
        accessibility_norm = np.divide(
            1,
            1 + np.exp(-accessibility_norm),
        )

    return accessibility_norm
jacobkimmel commented 1 year ago

FWIW, I have an implementation of this working now. Will PR in a few days once I've benchmarked.

Starlitnightly commented 1 year ago

FWIW, I have an implementation of this working now. Will PR in a few days once I've benchmarked.

Thanks for your work!