grenaud / gargammel

gargammel is an ancient DNA simulator
GNU General Public License v3.0
26 stars 15 forks source link

Modern to ancient? #6

Closed asylvz closed 5 years ago

asylvz commented 5 years ago

Hi,

I'm trying to convert a modern genome to ancient; is it possible with Gargammel or does it only simulate an ancient genome from scratch?

The reason I need this is that I want to have known structural variants of the genome, thus I create a simulated genome with a modern simulator such as Varsim (therefore it generates a VCF that comprises the known SV variants) than I want to convert this genome to ancient so that I will have the SVs of the ancient genome. Is this possible with gargammel?

Thanks, Arda

grenaud commented 5 years ago

Hi Arda! teşekkürler for your interest :-) Just to specify what gargammel can and cannot do. gargammel merely takes fasta references to describe 1) an endogenous source 2) a sampling contaminant source like a present-day human 3) microbial contamination. How you generate these fasta files is up to you. What it will give you is a set of fastq files with features of empirical data: aDNA damage+sequencing error+adapters.

In your situation, I would convert the VCF into fasta and use them in gargammel. Then you will have your fastq files that represent the empirical data.

Let me know if I have answered your question. Again, it depends which feature of aDNA will impact your SV detection. Is this what you want to test?

asylvz commented 5 years ago

Tak Gabriel, it's great :) We were considering programming something like this but it seems that we won't need to.

I already have the fastq files that harbour the SVs. Do I need some kind of preprocessing or is it enough to put them under endo folder?

The Makefile under exampleSeq folder is for a specific VCF file as far as I see. So do I need to modify and run that or do you have a script to create the required files given the fastq?

grenaud commented 5 years ago

fastq files with the SVs? do you mean fasta (reference representation) with the SVs? If so, please put them in the directory endo/ yes

The example in exampleSeq simply uses bcftools to generate a consensus from a fasta. However, I am not sure how bcftools consensus this will behave for large inserts or deletions. This is something to think about. Also, do you want the possibility of a diploid genome where one has the structural variant but the other one does not? That is another aspect to consider.

asylvz commented 5 years ago

Actually I have both fasta and fastq (the reads). Anyway, I put the fasta under endo/ folder and the cont/ contains your example data. When I run it, the reads (simulation_s1.fq.gz, simulation_s2.fq.gz) seem to be perfectly generated. By the way, my fasta is diploid (the start of the data in each chromosome are as follows). Probably art will take that into account when creating reads, won't it?

1_maternal 1_paternal 2_maternal 2_paternal 3_maternal 3_paternal 4_maternal 4_paternal 5_maternal 5_paternal

Seems like the reads generated are accurate in concordance with the fasta, right? (I'll use the VCF of fasta to check the accuracy of an algorithm that we have developed)

One suggestion, maybe you could also add an option to create reads with varying sizes.

Thanks a lot for all your support,

Best, Arda

grenaud commented 5 years ago

What do you mean your fasta are diploid? If you use IUPAC codes, please split them into 2 different fasta file, one maternal, one paternal.

The size is specified in the command line but, you could take the last ART command and fiddle with it to give you different read lengths. Specifying different read lengths upstream is tough, how many should be 100bp? how many 125bp etc.

On Sat, Oct 26, 2019 at 1:10 PM Arda Soylev notifications@github.com wrote:

Actually I have both fasta and fastq (the reads). Anyway, I put the fasta under endo/ folder and the cont/ contains your example data. When I run it, the reads (simulation_s1.fq.gz, simulation_s2.fq.gz) seem to be perfectly generated. By the way, my fasta is diploid (the start of the data in each chromosome are as follows). Probably art will take that into account when creating reads, won't it?

1_maternal 1_paternal 2_maternal 2_paternal 3_maternal 3_paternal 4_maternal 4_paternal 5_maternal 5_paternal

Seems like the reads generated are accurate in concordance with the fasta, right? (I'll use the VCF of fasta to check the accuracy of an algorithm that we have developed)

One suggestion, maybe you could also add an option to create reads with varying sizes.

Thanks a lot for all your support,

Best, Arda

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/grenaud/gargammel/issues/6?email_source=notifications&email_token=AAQRNI6C3RM4VG3JA56I4VDQQQQRZA5CNFSM4JEQZFKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECKFTTQ#issuecomment-546593230, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQRNI3QKFBJHSHHLYTNSODQQQQRZANCNFSM4JEQZFKA .

asylvz commented 5 years ago

I mean I have both maternal and paternal sequences in the same fasta. You mean create 2 fasta files; one for maternal and one for paternal sequences and put them under endo/ ?

Are you sure we need this, because the fasta file I use is the output of Varsim (it uses ART too) which is used in the read generation step. So probably ART takes care of it.

grenaud commented 5 years ago

Yes. The README states: "Each file inside represents a genome (not simply a chromosome or scaffold)." So each file in endo/ has to represent a genome. In the case of a diploid organism, you have 2 "genomes" to sample from, a paternal and maternal.

ART merely generates reads from a fasta file. It does not add the idiosyncrasies of ancient DNA like heavy fragmentation, damage and presence of adapters.

grenaud commented 5 years ago

Hi Arda, Do you have any further questions regarding this? If not, I will close the issue.

asylvz commented 5 years ago

Hi. I created the fastq files as you said, however I haven't been able to test them yet. But you can close the issue. I don't have any questions currently. Thanks

grenaud commented 5 years ago

Hi Arda, just to clarify. Inputs for the endo, bact and cont sources have to fasta references and not fastq that represent reads.