kbroman / qtl

R/qtl: A QTL mapping environment
https://rqtl.org
GNU General Public License v3.0
77 stars 45 forks source link

add format="mapqtl" to read.cross() #9

Closed timflutre closed 10 years ago

timflutre commented 10 years ago

Hello, some colleagues granted me access to already-published data they analyzed with JoinMap and MapQTL. As I prefer to work with free software, I started looking at R/qtl. I was wondering if there was some code in read.cross() that allows to directly load data from files in JoinMap/MapQTL formats? If no, I could maybe try to write an R function for this purpose (I say "try" because not all formats are tabulated, thus they are much harder to parse in R compare to, say, Python). Moreover, I can already say that I won't be able to write a parser able to distinguish all cases handled by JoinMap/MapQTL. In such a case, would you still be interested in integrating it into R/qtl? Otherwise, I'll write a quick-and-dirty parser for my own need. Best, Tim

kbroman commented 10 years ago

I would be happy to incorporate your code into R/qtl.

But if a Python script, say for converting JoinMap/MapQTL format to the R/qtl CSV format, is easier, you might just go with that. We could distribute it as part of R/qtl or via the R/qtl website.

karl

On Apr 11, 2014, at 8:43 AM, Timothée Flutre notifications@github.com wrote:

Hello, some colleagues granted me access to already-published data they analyzed with JoinMap and MapQTL. As I prefer to work with free software, I started looking at R/qtl. I was wondering if there was some code in read.cross() that allows to directly load data from files in JoinMap/MapQTL formats? If no, I could maybe try to write an R function for this purpose (I say "try" because not all formats are tabulated, thus they are much harder to parse in R compare to, say, Python). Moreover, I can already say that I won't be able to write a parser able to distinguish all cases handled by JoinMap/MapQTL. In such a case, would you still be interested in integrating it into R/qtl? Otherwise, I'll write a quick-and-dirty parser for my own need. Best, Tim

— Reply to this email directly or view it on GitHub.

timflutre commented 10 years ago

Ok, thanks! I'll look into that and will let you know (may take some time as I'm away next week).

timflutre commented 10 years ago

I wrote a function read.cross.mq to load data in MapQTL format. I have a data set at hand given by a colleague and I'm trying to avoid asking her to reformat the files. I'd rather prefer my code to handle things. But before sending you the files (or via a pull request) I have a few questions:

1) The data set I'm looking at doesn't have chromosomes but linkage groups. Therefore they have custom names, such as "C01fw2K", etc. Should I automatically rename them into "1", "2", etc? Moreover, should I put them all by default as autosomes in cross$geno ("A")?

2) About the phenotype file: should all strings in the file be converted into "factor" or can some remain as "character"? More specifically, should the individual identifiers be automatically converted from "character" to "factor"? Or should they even by automatically converted into integers (as they initially are in the sample file listeria_phe.csv)? For now, I have them as rownames of the data.frame cross$pheno.

3) About the phenotype file again: if no sex is specified, is it ok? For instance, the sample data file for "csvs" doesn't have any "sex" column, and thus cross$pheno$sex is fixed at "female" by default. Moreover, what does the column "pgm" correspond to? I'm asking because the file listeria_phe.csv doesn't contain any column of this name, but this column appears in the output of read.cross.

4) The data set I'm looking at corresponds to a cross in grapevine (perennial species, vegetative propagation) where both parents are highly heterozygous. Therefore there are several possible encodings for genotypes, e.g. for a locus heterozygous in both parents and with four alleles. So I don't really know how to convert such encodings into those used by R/qtl. First of all, should they be numeric?

Thanks in advance, Tim

kbroman commented 10 years ago

I wrote a function read.cross.mq to load data in MapQTL format. I have a data set at hand given by a colleague and I'm trying to avoid asking her to reformat the files. I'd rather prefer my code to handle things. But before sending you the files (or via a pull request) I have a few questions:

1) The data set I'm looking at doesn't have chromosomes but linkage groups. Therefore they have custom names, such as "C01fw2K", etc. Should I automatically rename them into "1", "2", etc? Moreover, should I put them all by default as autosomes in cross$geno ("A")?

You can leave the custom names. And yes make them autosomes.

2) About the phenotype file: should all strings in the file be converted into "factor" or can some remain as "character"? More specifically, should the individual identifiers be automatically converted from "character" to "factor"? Or should they even by automatically converted into integers (as they initially are in the sample file listeria_phe.csv)? For now, I have them as rownames of the data.frame cross$pheno.

Yes, I've been making cross$pheno a data.frame with non-numeric phenotypes as factors.

3) About the phenotype file again: if no sex is specified, is it ok? For instance, the sample data file for "csvs" doesn't have any "sex" column, and thus cross$pheno$sex is fixed at "female" by default. Moreover, what does the column "pgm" correspond to? I'm asking because the file listeria_phe.csv doesn't contain any column of this name, but this column appears in the output of read.cross.

You don't need a sex column. You only need sex and pgm if there's an X chromosome.

4) The data set I'm looking at corresponds to a cross in grapevine (perennial species, vegetative propagation) where both parents are highly heterozygous. Therefore there are several possible encodings for genotypes, e.g. for a locus heterozygous in both parents and with four alleles. So I don't really know how to convert such encodings into those used by R/qtl. First of all, should they be numeric?

R/qtl has cross type "4way" for a phase-known four-way cross. You need to know phase/haplotypes, so that the parents would be AB x CD and the offspring one of AC, AD, BC, BD. The numeric coding scheme I'm using is described in the help file for read.cross (see https://github.com/kbroman/qtl/blob/master/man/read.cross.Rd#L114-L119):

"For a 4-way cross, the mother and father are assumed to have genotypes AB and CD, respectively. The genotype data for the progeny is assumed to be phase-known, with the following coding scheme: NA = missing, 1 = AC, 2 = BC, 3 = AD, 4 = BD, 5 = A = AC or AD, 6 = B = BC or BD, 7 = C = AC or BC, 8 = D = AD or BD, 9 = AC or BD, 10 = AD or BC, 11 = not AC, 12 = not BC, 13 = not AD, 14 = not BD."

cheers, karl

timflutre commented 10 years ago

Thanks! I'm at question 4 now (converting genotype codes) but can't properly relate your explanation to the JoinMap/MapQTL manual for cases 5 to 14, sorry.

MapQTL considers 5 segregation types (table 6 in the manual):

: locus heterozygous in both parents, four alleles (no dominance allowed) : locus heterozygous in both parents, three alleles (no dominance allowed) : locus heterozygous in both parents, two alleles : locus heterozygous in one parent : locus heterozygous in other parent Then, each seg type has a range of possible genotypes (table 7 in the manual). Therefore, do cases 5 to 8 of R/qtl correspond to dominance? E.g. does "5 = A = AC or AD" mean that the A allele has to be dominant? If yes, then I guess this genotype should correspond to "h-" or "k-" in MapQTL (segregation type hkxhk). Then, to decide between "h-" and "k-", does it really matter that for R/qtl the mother has to be AB and the father CD? Or can I simply decide that, in a file in the MapQTL format, the first parent is always the mother and the second always the father? I also have questions for the remaining cases 9 to 14, but one step at a time...
timflutre commented 10 years ago

I think I got it (see below), sorry for the previous email.

Let's take the following example. In JoinMap/MapQTL, we have a marker with segregation type and phase {-1} (in parents). Thus the mother has genotype "nn" and the father has genotype "pn" (it would be "nn" and "np" if phase was {-0}). Translated into R/qtl format, the mother is "AB" and the father "CD". Thus, the offsprings can be AC (np), AD (nn), BC (np) and BD (nn). In numerics, genotypes of offsprings can be "7" (if AC or BC) or "8" (if AD or BD).

I'll now write some R code for all cases, and will thus soon be able to make a pull request.

kbroman commented 10 years ago

Hi, Tim,

I'm sorry I've not had time to think about this and give you advice.

I just remembered having received an email a while back about converting JoinMap files for use with R/qtl. You might want to take a look at Anderson et al. (2012); in particular, the supplement Text S1 (a python script within a Word doc) is, I think, trying to do what you're doing.

timflutre commented 10 years ago

Thanks for the link. I tested it on my data file but it failed and , as I'm already done with my R code, I won't dig deeper into it. How would you like me to proceed now? Should I fork your repo, create a new branch (named "mapqtl") from the current "master", and make a pull request for this new branch?

kbroman commented 10 years ago

That's right, but branch from "devel".

master is the released version (on CRAN); devel is the current state of development.

thanks! karl

On May 14, 2014, at 3:09 AM, Timothée Flutre notifications@github.com wrote:

Thanks for the link. I tested it on my data file but it failed and , as I'm already done with my R code, I won't dig deeper into it. How would you like me to proceed now? Should I fork your repo, create a new branch (named "mapqtl") from the current "master", and make a pull request for this new branch?

— Reply to this email directly or view it on GitHub.

timflutre commented 10 years ago

Ok.

Last question. When I incorporate my code into yours, and run read.cross(format="mapqtl", etc), it tells me "Warning: In summary.cross(cross) : The genetic maps should all be matrices with two rows.". This warning is only issued for 4-way crosses. What should the two rows be? For the moment, I only have one row, with the cumulative genetic distance, and with marker IDs as names:

head(out[[1]]$geno[[1]]$map) marker1 marker2 marker3 0.0 15.7 19.1

kbroman commented 10 years ago

Just give the same row repeated, with marker names as column names. It's possible to estimate sex-specific maps with 4-way crosses, in which case Female would go in the first row and male in the second.

karl

On May 14, 2014, at 6:46 AM, Timothée Flutre notifications@github.com wrote:

Ok.

Last question. When I incorporate my code into yours, and run read.cross(format="mapqtl", etc), it tells me "Warning: In summary.cross(cross) : The genetic maps should all be matrices with two rows.". This warning is only issued for 4-way crosses. What should the two rows be? For the moment, I only have one row, with the cumulative genetic distance, and with marker IDs as names:

head(out[[1]]$geno[[1]]$map) marker1 marker2 marker3 0.0 15.7 19.1

— Reply to this email directly or view it on GitHub.

timflutre commented 10 years ago

Last but not least, I am now coding a function to write cross data into the MapQTL format. It's easy for the phenotype and genetic map, but I don't see how I can export the genotype data. Where can I find the phase in the data structure for class "cross"? Moreover, if I take as an example the code "1" which corresponds to genotype "AC". It corresponds to several cases in the MapQTL format, as shown in the image below: slection_013 Sorry to bother you, but did I misunderstand something?

kbroman commented 10 years ago

I see the problem; R/qtl's data structure doesn't preserve the parental phase information.

But markers should have one of 5 patterns, according to which genotypes are observed:

5's and 6's 7's and 8's 1's, 2's, 3's, and 4's 1's, 10's and 4's 9's, 3's, and 2's

For each of these, you should be able to just pick any of the corresponding segregation patterns.

timflutre commented 10 years ago

Ok, so in my code I will always pick the first possible phase according to my table above. As I'm a bit new to this kind of models, I have no intuition whether or not the phase is used in QTL detection (conditional on the genetic map being known), and if choosing a phase arbitrarily will matter. Practically, I will check on simulated data if I find (more or less) the same results with R/qtl and MapQTL. Once it's the case, I'll be ready for a pull request.

kbroman commented 10 years ago

The different cases within one of the five distinct patterns won't make a difference; they just have to do with the labelling of alleles.

Consider <hkxhk>, {00} and <khxkh>, {11}: these are just relabellings hk.

timflutre commented 10 years ago

I am ready to commit my changes and make a pull request. However, a few days ago, I had a meeting on "intellectual property". In brief, I am the author of all the code I write, but my employer, INRA, is the owner of it. Therefore, I am supposed to write "Copyright (C) INRA" in the header of all the files I created or modified. Otherwise, the INRA has to sign an ownership transfer to the owner of R/qtl.

I am deeply aware of the insignificance of the code I'm ready to commit, and I would tend to think that there is no need to go into all this right now. However, on another side, if I want to promote free software around me (to other colleagues, etc), I need to play by the rules. Maybe it's a good occasion to see how hard it would be to ask my hierarchy to sign an ownership transfer. But even then, it may create you some problems as the University of Wisconsin may be the owner of R/Qtl instead of you. For instance, do you have the right to write "Copyright (C) Karl Broman"?

I read your post Things to avoid as a new faculty member and would much prefer not to fall into the first trap ("You may see lots of ways in which your department could be improved; don’t try to fix all of them at once"). So I would be inclined not to do anything right now, but I would be very glad if you could tell me what you think of all that.

kbroman commented 10 years ago

At the University of Wisconsin (and many universities in the US, including Johns Hopkins, where much of R/qtl was written), faculty retain ownership rights over intellectual property that we produce.

It would be good if you could get INRA to transfer ownership to you. If you then license the code with GPL-3, we can incorporate it into R/qtl, and I'll include you on the author list. Alternatively, perhaps INRA is willing to release the code under GPL-3, which would allow us to include it in R/qtl.

timflutre commented 10 years ago

Ok, I'll look into this asap. The last solution ("convince INRA to release the code under GPL-3") seems the best.

kbroman commented 10 years ago

I fixed a few things (maintaining order of chromosomes; a problem with reading phenotypes) and added tests. Seems to work, thanks!