Open mishugeb opened 3 years ago
Hi mihugeb,
thanks for your interest in TensorSignatures. Can you describe how you ran the tool on your data? The file Input_vcf_files.zip
does not contain any VCFs, and table found in Extracted_Signatures.txt
does not look like (or at least only like a partial) output that has been produced by TensorSignatures (e.g. the tool extracts SBS spectra for different genomic contexts and genomic parameters for each signatures).
Thanks!
vcf_input.zip ground.truth.syn.sigs.csv.zip https://github.com/sagar87/tensorsignatures/files/5852849/Extracted_Signatures.txt
Hello, Thanks for the quick responce. My bad that I did not upload the correct input folder with vcf file in the previous message. Also the extracted_signature.txt is the matrix we extracted using tensorsignatures.
We did the extraction using following steps: Here are the steps that I took to get from start to end after having simulated the vcf files. The steps for running TensorSignatures are up until the tensorsignatures write command, and after that are the steps for processing the output.
We first prepared the input data using docker:
*docker run -v $PWD:/usr/src/app/mount sagar87/tensorsignatures-data .vcf output.h5**
Next, we computed the trinucleotide normalization:
tensorsignatures prep output.h5 tsdata.h5
After that, we ran a denovo extraction using TensorSignatures:
for rank in {1..10} do for init in {0..9} do tensorsignatures train tsdata.h5 ssol${rank}_${init}.pkl ${rank} -i ${init} -j MyFirstExperiment_s done done
Then we took the output pickle files and used tensorsignatures write to summarize the results from multiple initializations:
*tensorsignatures write "ssol.pkl” results.h5**
After that, we used the Experiment class from tensorsignatures to load the cluster initializations of each decomposition rank from the previous step. Using that, generated the seaborn box plot to determine the optimal rank. Then used that solution:
experiment = ts.Experiment(“results.h5") cluster = experiment['/MyFirstExperiment_s/6’] signatures=cluster.S sig=signatures.mean(axis=(0,1,2,3,4,7)) sig=pd.DataFrame(sig) sig.to_csv("signatures.txt", sep="\t”)
Please let me know if there is anything that I can clarify.
Hi misguheb,
thanks for providing me the vcf
files. I ran tensorsignatures
on your dataset to double check. So the way you would assess the algorithm would be as follows:
As you described, start by loading the summarised hdf5
file using ts.Experiment
E = ts.Experiment('/homes/harald/research/experiments/2021.01.24_sim/results.h5')
and perform model selection using the BIC (minimum is 5).
sns.boxplot(x='rank', y='BIC', data=E.summary_table)
Then you select the rank 5 cluster which contains multiple solutions (this depends on how many times you ran the algorithm for init in {1..10} ...
). From this cluster you want to select the best solution (i.e. which achieved the highest log-likelihood during optimisation), you can get this solution by indexing the cluster with the clu.init
field.
clu = E['/MISGEHUB50/5']
init = clu[clu.init]
To inspect extracted signatures you may want to plot them
plt.figure(figsize=(16, 6))
init.plot_signatures()
which shows you extracted signatures in transcription (left) and replication (right) context (dark and light colors correspond to mutation type probabilities on coding/template and leading/lagging strand).
Now to assess the signature fit you want to compare the spectra to the original signatures. The tensorsignatures
package comes with the ts.assign_signatures
function which computes pairwase cosine distances between two signature matrices and matches most similar signatures. In my test no signature had a greater cosine distance than 0.00832
.
sigs = pd.read_csv('/homes/harald/research/experiments/2021.01.24_sim/ground.truth.syn.sigs.csv')
reference = sigs.iloc[:,2:].values
r, c, dist = ts.assign_signatures(reference, init.S[2,2,0,0,0,:,:,0])
ax= sns.heatmap(np.eye(5) * dist, cmap='Reds')
ax.set_ylabel('Reference signature')
ax.set_xlabel('Matched tensor signature')
ax.set_title('Cosine distance')
ax.set_yticklabels(r)
ax.set_xticklabels(c)
Notice the indexing of the signature tensor (init.S[2,2,0,0,0,:,:,0]
), this is important as the algorithm infers the spectra across different genomic contexts (the axes of the tensor encode transcription
x replication
x epigenetic states
x nucleosomal states
x clustering
x trinucleotides
x signature
x initialisation
). Indexing init.S
with [2,2,0,0,0,:,:,0]
selects the spectrum in across baseline states. This gives you a (96 x 5) matrix, which you also get from conventional NMF algorithms.
I've uploaded my results so you can check them for yourself.
Please let me know if anything is left unclear and thank you for trying out the algorithm :) .
Thanks a lot for your help! We are trying to find out differnces between your and our approaches.
Cool, let me know if you have questions.
Hello,
I am trying to reproduce the results that you generated using the input vcfs provided by @mishugeb. Could you please clarify what parameters you used to generate results.h5.zip? I believe that initializations were from {0..51} and that epoch was set to 30000. Please let me know if there were any other parameters that were changed.
Additionally, I am trying to run the code on a GPU instance. I believe I am still having an issue because the output from $ nvidia-smi
shows a process on the GPU that has ~70MiB GPU Memory Usage but GPU-Util remains at 0%. I will post the output from the $ nvidia-smi
on #9.
Thank you!
Hi mdbarnesUCSD,
you should be able to reproduce the results above by executing
$ docker run -v $PWD:/usr/src/app/mount sagar87/tensorsignatures-data *.vcf output.h5
in the folder where you downloaded mishugeb's vcf
files to. Then run
tensorsignatures prep output.h5 final.h5
to compute the trinucleotide normalisation across genomic states. I then have a generic bash
script which runs tensorsignatures
for several initialisations and decomposition ranks (see below). You would have to adapt this for your cluster environment or AWS.
#!/bin/bash
# you can ignore the following lines as they are cluster specific
# perhaps something similar is needed to get tensorflow-gpu running on AWS
. /etc/profile.d/modules.sh
module purge
module load rhel7/default-gpu
module load python-3.6.1-gcc-5.4.0-xk7ym4l
module unload cuda/8.0
module load cuda/9.0
module load cudnn/7.0_cuda-9.0
# source the virtualenv where you have installed tensorflow-gpu, tensorsignatures etc...
source /home/hsv23/rds/hpc-work/2019.11.29_mathi/env/bin/activate
# path to the input file (output of tensorsignatures prep)
INPUT=/home/hsv23/rds/hpc-work/2021.01.24_misgehub/final.h5
MINRANK=2
MAXRANK=10
MINITER=0
MAXITER=50
# for the example I used a overdispersion of 50 (this usually works well
k=50
EPOCHS=30000
for ((r=$MINRANK; r<=$MAXRANK; r++)) do
for ((i=${MINITER}; i<=${MAXITER}; i++)) do
tensorsignatures --verbose train ${INPUT} MISGEHUB_${k}_${r}_${i}.pkl ${r} -k ${k} -i ${i} -j MISGEHUB${k} -n -ep ${EPOCHS};
done;
done;
Finally combine the results
tensorsignatures write "MISGEHUB_*.pkl" results.h5
I hope this helps.
Thanks this was very helpful!
Does tensorsignatures train
have the '-n' flag set by default or does this flag need to be provided?
Yes, please set the -n
flag as it is necessary to factor in the trinucleotide normalisation during the factorisation.
I have one more question. Can we align/decompose extracted signatures by tensorsignature to the COSMIC signatures?
Thanks
Extracted_Signatures.txt Input_vcf_files.zip ground.truth.syn.sigs.csv.zip
I hope that you are doing well!
I have a question about your tool for the analysis of mutational signatures: TensorSignatures. As part of a large-scale benchmarking of tools for the analysis of mutational signatures, we have been trying to run TensorSignatures with synthetic datasets (i.e., ones with known signatures and activities of these signatures). Unfortunately, we keep getting really poor results even on the simplest datasets with TensorSignatures (example attached: 5 signatures in 200 genomes).
I wanted to double-check that we are not making any mistake in running your tool. I have attached to this email files that include a folder containing the ground truth signatures, the exact input data used to run TensorSignatures, and the output (Extracted_Signatures.txt) of the tool. The signatures extracted by TensorSignatures are matching only 3 signatures of the ground-truth signatures with cosine-similarity more than 0.9. Any advice will be greatly appreciated as we do not misreport the poor performance of the tool.