schneebergerlab / AMPRIL-genomes

scripts for the project of seven thaliana genomes assembly
38 stars 18 forks source link

output of pangenome #4

Open liufy11 opened 4 years ago

liufy11 commented 4 years ago

Hello! I get the result of pangenome. but I don't know the meaning of the file. like pan-genome.wga.conensus.stats,pan-genome.wga.core.stats, pan-genome.wga.newseq.stats and files in tmp. like An-1.com-2.wga.bed, tmp.An-1.C24.bed and so on .

wen-biao commented 4 years ago

Hi,

the output files resulted from pan-genome construction based on WGA:

pan-genome.wga.conensus.stats # for the pan-genome size

pan-genome.wga.core.stats # for the core-genome size

pan-genome.wga.newseq.stats # for the new sequence size

Since we have eight genomes, the result file contains eight lines.

Each line represents the pan-genome or core-genome or new sequence size under different number of input genomes (from 1 to 8 genomes).

Each line is tab-separated, each number is the pan-genome or core-genome or new sequence size calculated under different combinations of genomes.

the files with prefix tmp are just the intermediate files.

liufy11 commented 4 years ago

I 'm not understand you clearly. the picture is my result . there are only three samples in my project. what is  the columns means ? It means sizes or regions ? and can I get a final pangenome of the the three samples.

------------------ 原始邮件 ------------------ 发件人: "Wen-Biao Jiao"<notifications@github.com>; 发送时间: 2020年8月31日(星期一) 下午3:59 收件人: "schneebergerlab/AMPRIL-genomes"<AMPRIL-genomes@noreply.github.com>; 抄送: "fangying"<365698105@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [schneebergerlab/AMPRIL-genomes] output of pangenome (#4)

Hi,

the output files resulted from pan-genome construction based on WGA:

pan-genome.wga.conensus.stats # for the pan-genome size

pan-genome.wga.core.stats # for the core-genome size

pan-genome.wga.newseq.stats # for the new sequence size

Since we have eight genomes, the result file contains eight lines.

Each line represents the pan-genome or core-genome or new sequence size under different number of input genomes (from 1 to 8 genomes).

Each line is tab-separated, each number is the pan-genome or core-genome or new sequence size calculated under different combinations of genomes.

the files with prefix tmp are just the intermediate files.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

wen-biao commented 4 years ago

For your case including three genomes, the last line of the file pan-genome.wga.conensus.stats should contains three numbers. Because each genome can be selected as the reference and other genomes will be compared to the reference. Each number of the last line is the pan-genome size (total length of non-redundant sequences) when three genomes are included.

If you just expect one number for the pan-genome including all your genomes, just select one of them.

liufy11 commented 4 years ago

take pan-genome.wga.conensus.stats for example .(1)why there are three columns in the first line? and what are each columns means ? (2)why there are six columns  in the second line but three columns in the third line? (3)the file only gives the size of pangenome, and don't show me regions that pangenome comes from. so I can't select and merge sequences from the original three genome to get a final pangenome fasta file. Is that right ? thanks for your patient answer.

------------------ 原始邮件 ------------------ 发件人: "Wen-Biao Jiao"<notifications@github.com>; 发送时间: 2020年8月31日(星期一) 下午4:42 收件人: "schneebergerlab/AMPRIL-genomes"<AMPRIL-genomes@noreply.github.com>; 抄送: "fangying"<365698105@qq.com>; "Author"<author@noreply.github.com>; 主题: Re: [schneebergerlab/AMPRIL-genomes] output of pangenome (#4)

For your case including three genomes, the last line of the file pan-genome.wga.conensus.stats should contains three numbers. Because each genome can be selected as the reference and other genomes will be compared to the reference. Each number of the last line is the pan-genome size (total length of non-redundant sequences) when three genomes are included.

If you just expect one number for the pan-genome including all your genomes, just select one of them.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

wen-biao commented 4 years ago

Again, since you have three genomes, each genome can be selected as the reference. The first line means if you just select one genome to build pan-genome, you will have three (any one of your three genomes). All columns except the first column indicate the pan-genome size. The second line means you select two genomes from your three genomes, you have 2*3 combinations (again all genomes can be the reference),

Yes, this only gives the size of pan-genome because these scripts are just used for the project that we recently did, not for a general method of pan-genome building.