What is the meaning to the normal/tumor files for the input in somatic.pl

MenglinC commented 3 years ago

Dear Professors@wtwt5237:

Recently,I have been learning this pipeline,and hope to transplant it in our own data.However,I have some questions about the input data in the somatic.pl.Why you set both the normal and tumor samples at the same time?Does it means that compare the tumor sample with the normal sample,and turn out the mutations in tumor samples against the normal?

I also notice your annoucement that"For tumor-only calling, put "NA NA" in the slots of the normal samples. Results will be written to germline files",Or maybe we can use the tumor only to call the germline files,while use the normal only to call what? I can not understand the pair of normal and tumor samples,can how to define?The cells that come from the normal and tumor tissue from one patient?or the cells form the normal and camer patients respectively?

In other words,If I want to call the normal person's somatic mutations in one particular tissues to traces their development lineage,How can I input my files?

Hope for your suggestions!Thank you for your nice work! Best regrads, Xiu

tianshilu commented 3 years ago

Hi Xiu,

Thank you for your interest in our somatic calling pipeline!

We input normal samples and tumor samples simultaneously to call somatic mutations from tumor samples and get rid of the germline mutations found in both normal and tumor samples. So the normal samples are served as the control sample here. You can definitely call somatic mutations from normal samples by the pipeline by inputting your samples of interest in the slots for "tumor samples". It will be better to put blood samples (or normal samples of other tissues) in the slots for "normal samples" to get rid of germline mutations. If you don't have such control samples, you can use tumor-only mode by putting "NA NA" in the first two slots.

Let me know if you have any other questions.

Tianshi

MenglinC commented 3 years ago

Dear @tianshilu

I am very excited to reveive your letter!Your explanation about the input files in somatic.pl makes me enlightened at once! Really thanks a lot!

Xiu

MenglinC commented 3 years ago

Hi tianshi @tianshilu ,I also have some questions about the code job_somatic.pl. The command and the example command you introduced are as follows, perl /Directory/to/folder/of/code/job_somatic.pl design.txt example_file thread build index java17 disambiguate_pipeline

perl ~/somatic/job_somatic.pl somatic_design.txt ~/somatic/example/example.sh 32 hg38 ~/ref/hg38/hs38d1.fa /cm/shared/apps/java/oracle/jdk1.7.0_51/bin/java 0 2 ~/disambiguate_pipeline

However,I have two questions about this: (1)where is your example.sh file,I can not find it in the example folder as your introduced?By learning your source code,Does it only have one line,which is "JOBSTART"? (2)what is the meaning of the "2" in the end of the example command?Does it can be simply understood as the lines in somatic_design.txt? (3)what is the biological meaning of the disambiguate in your pipeline?

Hope for your suggestions!Thanks!

Xiu

tianshilu commented 3 years ago

Hi @MenglinC ,

(1) example.sh is a bash file to submit batch jobs on your system. In our case, we used SLURM system for submitting jobs. You can create your own example file based on your system. I uploaded one example.sh under example folder for your reference.

(2) "2" means two jobs for one submission. In other words, two lines in the somatic_design.txt are submitted at each time.

(3) Disambiguate.py is for disambiguating human and mouse reads for patient-derived xenograft (PDX) samples.

Thanks!

MenglinC commented 3 years ago

Hi @tianshilu，

Thank you for your past help.I have recently operated the somatic.pl successfully on the example data about the mtDNA that you provided.However,I also meet with another problems.That is How can I vertify the precision of my result!The result of this pipeline is three files,which is coverage.txt,germline_mutations_hg19.txt and somatic_mutations_hg19.txt. I list them as follows:

coverage.txt
germline_mutations_hg19.txt
somatic_mutations_hg19.txt I try to campare the result files that you provided in the folder /QBRC/example/example_dataset/example_output/ and the /QBRC/example/example_dataset/output/. It turns out to be quite different and that makes me question the reliability of my processing flow! Therefore,can you give me some suggestions about how to vertify the reliability of our result?Really thanks very much!

Best regards, Xiu

tianshilu commented 3 years ago

Hi @MenglinC ,

Thank you for your interest in the somatic mutation calling pipeline. The suffix of your output files is hg19. I guess the reference genome you chose is hg19. As you can see, the suffix of the example files is hg38. The reference I did for the example is hg38. The reference genome could lead to quite different result.

Thanks! Tianshi

MenglinC commented 3 years ago

Hi @tianshilu ,

Thank you for your instant reply!Regarding to this issue,I have another questions! That is,other than the different references,does any factors can greatly affect the result of this pipeline that I can pay attention to in my own further project!And I also want to know what solutions I can take to vertify the mutation calling results,becasuse there is too many uncontrollable factors in this processing procedure,and How can I prove the reliability of my results?I do not know if you ever have the same reflection about this problems? I will appreciate it if you could give me some suggestions! Thanks a lot! Xiu

MenglinC commented 3 years ago

Hi @tianshilu ,

I am sorry to trouble you again!One more question! I have noticed that in the somatic.pl code file,you use Mutect 1.1.7 as one of the variant callers for RNA-seq data.However,these days I read some papers which generally use the Mutect2 to call the somatic mutations.So,with regard to the choice of Mutect version,What is the reason behind it?

Thanks very much! Xiu

MenglinC commented 3 years ago

Hi @tianshilu ,

I am sorry to trouble you again!One more question! I have noticed that in the somatic.pl code file,you use Mutect 1.1.7 as one of the variant callers for RNA-seq data.However,these days I read some papers which generally use the Mutect2 to call the somatic mutations.So,with regard to the choice of Mutect version,What is the reason behind it?

Thanks very much! Xiu

MenglinC commented 3 years ago

Hi @tianshilu ,

Now,I have processed to the operation of the cnv.pl. Maybe this program is suitiable for the exome-seq ?Can I use the bam files produced from the somatic.pl in the process of the single cell RNA-seq data?

Xiu

tianshilu / QBRC-Somatic-Pipeline

What is the meaning to the normal/tumor files for the input in somatic.pl #8