sagar87 / tensorsignatures

Tensorframework for mutational signature analysis.
MIT License
6 stars 4 forks source link

Question about TensorSignatures #13

Open mishugeb opened 3 years ago

mishugeb commented 3 years ago

Extracted_Signatures.txt Input_vcf_files.zip ground.truth.syn.sigs.csv.zip

I hope that you are doing well!

I have a question about your tool for the analysis of mutational signatures: TensorSignatures. As part of a large-scale benchmarking of tools for the analysis of mutational signatures, we have been trying to run TensorSignatures with synthetic datasets (i.e., ones with known signatures and activities of these signatures). Unfortunately, we keep getting really poor results even on the simplest datasets with TensorSignatures (example attached: 5 signatures in 200 genomes).

I wanted to double-check that we are not making any mistake in running your tool. I have attached to this email files that include a folder containing the ground truth signatures, the exact input data used to run TensorSignatures, and the output (Extracted_Signatures.txt) of the tool. The signatures extracted by TensorSignatures are matching only 3 signatures of the ground-truth signatures with cosine-similarity more than 0.9. Any advice will be greatly appreciated as we do not misreport the poor performance of the tool.

sagar87 commented 3 years ago

Hi mihugeb,

thanks for your interest in TensorSignatures. Can you describe how you ran the tool on your data? The file Input_vcf_files.zip does not contain any VCFs, and table found in Extracted_Signatures.txt does not look like (or at least only like a partial) output that has been produced by TensorSignatures (e.g. the tool extracts SBS spectra for different genomic contexts and genomic parameters for each signatures).

Thanks!

mishugeb commented 3 years ago

vcf_input.zip ground.truth.syn.sigs.csv.zip https://github.com/sagar87/tensorsignatures/files/5852849/Extracted_Signatures.txt

Hello, Thanks for the quick responce. My bad that I did not upload the correct input folder with vcf file in the previous message. Also the extracted_signature.txt is the matrix we extracted using tensorsignatures.

We did the extraction using following steps: Here are the steps that I took to get from start to end after having simulated the vcf files. The steps for running TensorSignatures are up until the tensorsignatures write command, and after that are the steps for processing the output.

We first prepared the input data using docker:

*docker run -v $PWD:/usr/src/app/mount sagar87/tensorsignatures-data .vcf output.h5**

Next, we computed the trinucleotide normalization:

tensorsignatures prep output.h5 tsdata.h5

After that, we ran a denovo extraction using TensorSignatures:

for rank in {1..10} do for init in {0..9} do tensorsignatures train tsdata.h5 ssol${rank}_${init}.pkl ${rank} -i ${init} -j MyFirstExperiment_s done done

Then we took the output pickle files and used tensorsignatures write to summarize the results from multiple initializations:

*tensorsignatures write "ssol.pkl” results.h5**

After that, we used the Experiment class from tensorsignatures to load the cluster initializations of each decomposition rank from the previous step. Using that, generated the seaborn box plot to determine the optimal rank. Then used that solution:

experiment = ts.Experiment(“results.h5") cluster = experiment['/MyFirstExperiment_s/6’] signatures=cluster.S sig=signatures.mean(axis=(0,1,2,3,4,7)) sig=pd.DataFrame(sig) sig.to_csv("signatures.txt", sep="\t”)

Please let me know if there is anything that I can clarify.

sagar87 commented 3 years ago

Hi misguheb,

thanks for providing me the vcf files. I ran tensorsignatures on your dataset to double check. So the way you would assess the algorithm would be as follows:

As you described, start by loading the summarised hdf5 file using ts.Experiment

E = ts.Experiment('/homes/harald/research/experiments/2021.01.24_sim/results.h5')

and perform model selection using the BIC (minimum is 5).

sns.boxplot(x='rank', y='BIC', data=E.summary_table)

image

Then you select the rank 5 cluster which contains multiple solutions (this depends on how many times you ran the algorithm for init in {1..10} ...). From this cluster you want to select the best solution (i.e. which achieved the highest log-likelihood during optimisation), you can get this solution by indexing the cluster with the clu.init field.

clu = E['/MISGEHUB50/5']
init = clu[clu.init]

To inspect extracted signatures you may want to plot them

plt.figure(figsize=(16, 6))
init.plot_signatures()

image

which shows you extracted signatures in transcription (left) and replication (right) context (dark and light colors correspond to mutation type probabilities on coding/template and leading/lagging strand).

Now to assess the signature fit you want to compare the spectra to the original signatures. The tensorsignatures package comes with the ts.assign_signatures function which computes pairwase cosine distances between two signature matrices and matches most similar signatures. In my test no signature had a greater cosine distance than 0.00832.

sigs = pd.read_csv('/homes/harald/research/experiments/2021.01.24_sim/ground.truth.syn.sigs.csv')
reference = sigs.iloc[:,2:].values
r, c, dist = ts.assign_signatures(reference, init.S[2,2,0,0,0,:,:,0])
ax= sns.heatmap(np.eye(5) * dist, cmap='Reds')
ax.set_ylabel('Reference signature')
ax.set_xlabel('Matched tensor signature')
ax.set_title('Cosine distance')
ax.set_yticklabels(r)
ax.set_xticklabels(c)

image

Notice the indexing of the signature tensor (init.S[2,2,0,0,0,:,:,0]), this is important as the algorithm infers the spectra across different genomic contexts (the axes of the tensor encode transcription x replication x epigenetic states x nucleosomal states x clustering x trinucleotides x signature x initialisation). Indexing init.S with [2,2,0,0,0,:,:,0] selects the spectrum in across baseline states. This gives you a (96 x 5) matrix, which you also get from conventional NMF algorithms.

I've uploaded my results so you can check them for yourself.

results.h5.zip

Please let me know if anything is left unclear and thank you for trying out the algorithm :) .

mishugeb commented 3 years ago

Thanks a lot for your help! We are trying to find out differnces between your and our approaches.

sagar87 commented 3 years ago

Cool, let me know if you have questions.

mdbarnesUCSD commented 3 years ago

Hello,

I am trying to reproduce the results that you generated using the input vcfs provided by @mishugeb. Could you please clarify what parameters you used to generate results.h5.zip? I believe that initializations were from {0..51} and that epoch was set to 30000. Please let me know if there were any other parameters that were changed.

Additionally, I am trying to run the code on a GPU instance. I believe I am still having an issue because the output from $ nvidia-smi shows a process on the GPU that has ~70MiB GPU Memory Usage but GPU-Util remains at 0%. I will post the output from the $ nvidia-smi on #9.

Thank you!

sagar87 commented 3 years ago

Hi mdbarnesUCSD,

you should be able to reproduce the results above by executing

$ docker run -v $PWD:/usr/src/app/mount sagar87/tensorsignatures-data *.vcf output.h5

in the folder where you downloaded mishugeb's vcf files to. Then run

tensorsignatures prep output.h5 final.h5

to compute the trinucleotide normalisation across genomic states. I then have a generic bash script which runs tensorsignatures for several initialisations and decomposition ranks (see below). You would have to adapt this for your cluster environment or AWS.

#!/bin/bash

# you can ignore the following lines as they are cluster specific
# perhaps something similar is needed to get tensorflow-gpu running on AWS
. /etc/profile.d/modules.sh
module purge
module load rhel7/default-gpu              
module load python-3.6.1-gcc-5.4.0-xk7ym4l
module unload cuda/8.0
module load cuda/9.0
module load cudnn/7.0_cuda-9.0

# source the virtualenv where you have installed tensorflow-gpu, tensorsignatures etc...
source /home/hsv23/rds/hpc-work/2019.11.29_mathi/env/bin/activate

# path to the input file (output of tensorsignatures prep)
INPUT=/home/hsv23/rds/hpc-work/2021.01.24_misgehub/final.h5

MINRANK=2
MAXRANK=10
MINITER=0
MAXITER=50

# for the example I used a overdispersion of 50 (this usually works well
k=50
EPOCHS=30000

for ((r=$MINRANK; r<=$MAXRANK; r++)) do
  for ((i=${MINITER}; i<=${MAXITER}; i++)) do
    tensorsignatures --verbose train ${INPUT} MISGEHUB_${k}_${r}_${i}.pkl ${r} -k ${k} -i ${i} -j MISGEHUB${k} -n -ep ${EPOCHS};
  done;
done;

Finally combine the results

tensorsignatures write "MISGEHUB_*.pkl" results.h5

I hope this helps.

mdbarnesUCSD commented 3 years ago

Thanks this was very helpful!

Does tensorsignatures train have the '-n' flag set by default or does this flag need to be provided?

sagar87 commented 3 years ago

Yes, please set the -n flag as it is necessary to factor in the trinucleotide normalisation during the factorisation.

mishugeb commented 3 years ago

I have one more question. Can we align/decompose extracted signatures by tensorsignature to the COSMIC signatures?

Thanks