nicolazzie / AffyPipe

an open-source pipeline for Affymetrix Axiom genotyping workflow on livestock species
13 stars 7 forks source link

Best Practice pipeline crashing #7

Open jcgrenier opened 8 years ago

jcgrenier commented 8 years ago

Hello Dr. Nicolazzi,

I'm trying to use your pipeline presently for some of our arrays that we just got. I have multiple issues but here's one of them in particular.

When I'm trying to run the "BestPractice" pipeline, it seems that the program is crashing at the place where your program is computing the PlateQC metrics. Here's the error that I'm getting :

Traceback (most recent call last): File "$affyPipeDir/AffyPipe.py", line 478, in PlatePR=round(100.*float(c_plates[plate[0]])/float(plate[1]),2) KeyError: 'A12_CPTP_UKBB_BCGP48_P001_BCG000033'

Note, that the name of the CEL file is not complete there, it's missing 4 characters. Is there a limit to the number of characters for this variable?

Here's the command line that I used :

python AffyPipe.py -t $tooldir -d 0.82 -c 0.97 -l 0.95,0.985 -a $aptdir/apt-1.18.0-x86_64-intel-linux -s $snpPolisherDir/SNPolisher_package -o $outputDir --plinkACGT -y --debug -b mycellistfile.txt

Furthermore, when comparing with the normal Affymetrix program (Axiom_ Analysis_Suite), I'm getting pretty different SNPQC measures (calling rate much lower for AffyPipe), resulting in multiple samples that are getting trashed that were not with the other software.

I'm also loosing all the positions coming from the mitochondria and the Y chromosome (probably some positions coming from the nonPAR region of the X chromosome too..). I modified the python script to get the Hemizygous positions now, but I still have that bug, so I can't tell for now if I'm recovering all of them.

Thanks for helping me.

JC

nicolazzie commented 8 years ago

Dear JC, Sorry for thelatereply. It is difficult to debug this remotely without more information but I'll give it a try. My best guess for the first problem is that, reading the command you used to launche the program assumes that recognition of plates (required for the best practice) is left to Affypipe. However, seeing the names of your cel file, this is probably not possible (no common pattern plate-Wise). My suggestion is to create a file with plates (the indications of how to do that are in the readme, but essentially this is a list of celfiles and plates ids, commaseparated), amd include that information to the command line. In any case the program is giving a nasty non-comprehensible error mesage, and I should (and will) fix that.

As for the second problem, that is quie odd... And worrying. Affypipe was double-checked many times against affymetrix genotypingconsole (affy's windows pgms) always giving the same results! Actually we're only streamlining affy's programs, not doing anything on our own! Could you send me an example of this? Couldit be (and I'm just guessing here, that you used different datasets or different library files (versions)? I really would like to go to the bottom of this.. Today I'll not be "unplugged" from the net for the whole day, but please send me as much info you can, also directly to my email (consideing these files get huge in no time, using google drive? ) My gmail address is ezequielluis.nicolazzi@gmail.com)

Hope to be able to sort this out for you quickly. Best, ezequiel

jcgrenier commented 8 years ago

Hello Ezequiel,

Thanks for the reply. I began by testing the first solution, which was to create a file with the plate information. That worked perfectly.

Have you ever worked on the Axiom_UKB_WCSG.r3 array? I'm working with that one. Another issue that I saw, was that when I'm letting the files like the one given by affymetrix directly, I saw that in this file : Axiom_UKB_WCSG.r3.step1.ps, there are only 20000 probes, and that it's those ones that are reported in the end in my plink files.

So I replaced this file with this one : Axiom_UKB_WCSG.r3.BiallelicPlusUnsupportedMultiallelic.ps, which had the highest amount of probes in all of the zipped files. Did you ever had that problem?

Thanks

nicolazzie commented 8 years ago

Great! I will modify the script in order to force users adding the file with plates when running the bestpractice procedure. That will fix the "bug". Thanks for the feedback.

As for your question, my best guess is that the .ps file is the only way Affy has to distinguish between probes that work and those that don't. It's cheaper than re-do the array... just using that file they decide what probes to get out of the array. Nothing like that happened in any of the species I worked with or, to say it better, probably that intermediate step was already taken before I started working with the data.

What about the different results? Did the plate file inclusion fixed that also?

Please do not hesitate to contact me if you find any oddities.. hopefully not! Ez.

jcgrenier commented 8 years ago

Hello Ezequiel,

Sorry for the very late reply. I was mainly running my test on the windows program by now. I'm still curious why it's not working properly thought, as it will be really more easy for me to automatize all the arrays that I will get.

So with my new test, I used the reference files downloaded with the windows program, to be sure I have the exact same set. Using this, I'm getting the same DQC values, but, without changing the Axiom_UKB_WCSG.r3.step1.ps file, I'm getting in the end only a set of plink files with ~18k markers (which are probably coming from the 20k markers used in the 1st SNPQC step. The calling rates values that I'm getting for the first step are also different from the ones I got with the windows program.

The "recommended.ps" file that I'm getting after the SNPolisher step is also pretty small, only containing 5.5k markers. That was the file that they were recommending us to use with the windows program for the final file output.

How do I get the complete set of marker in my plink files in the end? Do I need to play with that Axiom_UKB_WCSG.r3.step1.ps file or is there another way to specify the complete list of snps?

Thanks for your help,

Best, JC

nicolazzie commented 8 years ago

Hi JC, it's difficult to troubleshoot this without some try-and-error on my side. Probably sending me (just 1 would do) a raw starting file and letting me know which libraries (annotation and analyses files) are you using I can be much more effective in helping you. This said, also having the logs and the parameter file you're using can help me help you. The fact that you get a plink file with 18k SNPs makes me think that you're using Axiom_UKB_WCSG.r3.step1.ps as the annotation file in the parameter file? That might not be the best of the ideas. AffyPipe (actually Affy's original progs) know what to search for, and they do that automatically. The annotation file you should input in your param file should be the TOTAL one, as this is used only by the R script. It's puzzling however that DQC values are identical but not call rates... is this for the FIRST step call rate or for the second one? Are you talking about individual call rate or plate call rate? Are you talking about just "call rate" or the value of "total call rate". I'd love to help you figuring this out, to know if this is an Affy software problem, my software problem, or just a problem in the parameters given to the program... let me know! Best, ezequiel