niuhuifei / popoolation2

Automatically exported from code.google.com/p/popoolation2
2 stars 2 forks source link

Read Groups to define populations #17

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
This isn't really an 'issue', but it something for which the syntax is baffling 
me. The read groups are assigned as such:

RG='@RG\tID:1\tPL:ILLUMINA\tSM:'$READ_GRP'\tDS:ref='$ASSEMBLY',pfx='$REF_PFX

Where $READ_GRP will be different for each sample, but $ASSEMBLY and $REF_PFX 
will be the same. 

So normally I merge each population into one bam file then use mpileup on the 
two bam files, and in the past this worked as long as I specified no read 
groups during assembly and added them with bamaddrg. But I switched to GATK 
which requires read groups. 

If I leave these read groups as is, popoolation does not see them as different 
samples the way it did in the past (when I just assigned read groups with 
bamaddrg). If I secondarily add read groups using bamaddrg I end up with 40 
'samples' which doesn't even make sense in terms of multiples since I have 12 
samples (2 populations of 6). 

What steps will reproduce the problem?
1. Use read groups in the original assembly as specified above
2. Either make no additional modifications or add read groups using bamaddrg
3. End up with 2 populations of 1 or 2 populations of 20 (respectively) instead 
of 2 populations of 6. 

What is the expected output? What do you see instead?

I end up with 2 populations of 1 if I add no additional read groups.

I end up with 2 populations of 20 if I add read groups with bamaddrg. 

I have two populations of 6. 

What version of the product are you using? On what operating system?
popoolation2 on both osx and centos.

Please provide any additional information below.

Original issue reported on code.google.com by sasig...@ucdavis.edu on 2 May 2014 at 11:59

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
If I change all the read groups using picard to be closer to what they were 
with bamaddrg such that they are:
@RG     ID:sample   PL:illumina     PU:G    LB:sample   SM:sample

And merge the files using picard such that there are six read groups in two 
files, popoolation somehow sees two populations of 20. I had this working just 
fine with files with readgroups that were pretty much the same, but done with 
bwa samtools instead of bWA GATK. 

@RG     ID:sample        SM:sample

What could possibly be causing popoolation to see two populations of 20? 

Original comment by sasig...@ucdavis.edu on 5 May 2014 at 10:11

GoogleCodeExporter commented 9 years ago
And to be clear, the pileup file has 12 samples in it and if you process it 
through to vcd in samtools it is how it should be, so its something about the 
process of syncing it in popoolation and how it reads the read groups.

Original comment by sasig...@ucdavis.edu on 6 May 2014 at 1:18

GoogleCodeExporter commented 9 years ago
hmm, not sure if i understand your problem.
At no step in pipeline popoolation has to deal with readgroups. i designed it 
modulary so read groups should be dealt with, well before you actually start 
with popoolation2. the mpileup2sync just translates the mpileup into a more 
convenient format but it totally preserves the information of the mpileup. so 
thats where read groups are important in the mpileup. I guess this must be a 
problem than how you created the mpileup. in the mpluep you should see your two 
populations of 6. btw what means 'of 6' is this coverage?
but if you still have the problem please send me a head of your mpileup

Original comment by RoKof...@gmail.com on 1 Nov 2014 at 7:21