legumeinfo / ArachisPheno

AraPheno source code for http://arapheno.1001genomes.org
MIT License
0 stars 0 forks source link

Scrub and prepare minicore data (for public version of ArachisPheno) #19

Open svengato opened 3 years ago

svengato commented 3 years ago

from Sudhansu:

The peanut pheno public data is in our DS and below is the link to the two relevant files.

The relevant DS directory: https://v1.legumefederation.org/data/public/Arachis_hypogaea/minicore.trt.JWYM/

Descriptors file: https://v1.legumefederation.org/data/public/Arachis_hypogaea/minicore.trt.JWYM/arahy.mincore.trt.JWYM.descriptors.xlsx Data file: https://v1.legumefederation.org/data/public/Arachis_hypogaea/minicore.trt.JWYM/arahy.mincore.trt.JWYM.observations.xlsx

The Data file (observations.xlsx) has several sheets and the two sheets that has the pheno data are:

  1. Obs-mini_core sheet

These phenotype data should have two observations for each accession, one for year 2013 and the other for year 2015. I think we should be able to treat them as replications.

  1. protein_oil-mini_core sheet

The structure comes out after you sort on the basis of the 'accession' column. Each accession has mostly 2-3 replications. Because we are able to store replicate data in in our database we should do that to preserve the original data as much as possible.

Please let me know if we should meet to talk about the data after you go through the files.

Thanks for moving it forward quickly.

Sudhansu

svengato commented 3 years ago
  1. Obs-mini_core sheet
  2. protein_oil-mini_core sheet

Do we consider these separate studies?

What does "minicore" mean? Did the previous unpublished data set have a name?

svengato commented 3 years ago

(Sudhansu's replies)

What does "minicore" mean?:

A very minimal set of accessions that represents the variability of traits of interest in the collection of accessions.

Did the previous unpublished data set have a name?:

I believe it was the peanut core collection. It is a larger collection than the minicore and should include the minicore as a subset.

Do we consider these separate studies?:

I will come back to you on this question after I read some some more information. A good question indeed. Thanks for going through the details.

sdash-github commented 3 years ago
Do we consider these separate studies?:
I will come back to you on this question after I read some some more
information. A good question indeed. Thanks for going through the
details.

a.
The data for traits in both the sheets were collected from the same
replicated study in Florida.  Biochemical data, protein and oil, was
from the same plants that were used earlier in the season for
measuring other phenotype data.  So, per this information, they
should be treated as one study.

b.
If it is necessary to identify the replicates that were used for
phenotype measurements and correspond/map them with the replicates
used for oil-protein studies, then I don't have that information. If
this correspondence is necessary for statistical purpose, we may
need to treat them as separate studies and make this clear in our
metadata.  If this distinction isn't really necessary for our
database, then it is the same study.  I seem to have forgotten how
we treated the last dataset that has been kept private.

Citation: Otyama, P. I., Wilkey, A., Kulkarni, R., Assefa, T., Chu, Y., Clevenger, J., O'Connor, D. J., Wright, G. C., Dezern, S. W., MacDonald, G. E., Anglin, N. L., Cannon, E., Ozias-Akins, P., & Cannon, S. B. (2019). Evaluation of linkage disequilibrium, population structure, and genetic diversity in the U.S. peanut mini core collection. /BMC genomics/, /20/(1), 481. https://doi.org/10.1186/s12864-019-5824-9

svengato commented 3 years ago

The traits in the two tables do not overlap, and could therefore be merged if appropriate.

Obs-mini_core
12479 rows, one trait per row (long format), 20 numeric traits, 1 binary trait
Seed weight,Hull weight,Extra large kernel weight,Medium kernel weight,#1 kernel weight,All other seed types weight,Percentage split kernels,Percentage Sound Mature Kernel,Meat/hull ratio,Pod weight per plot,Pod weight per hecatare,Plant height,Canopy width,Plant height/width ratio,Average leaflet length,Average leaflet width,Leaflet length/width ratio,Pod volume,Fancy pods,Percentage Fancy Pods,Main stem flower

These phenotype data should have two observations for each accession, one for year 2013 and the other for year 2015. I think we should be able to treat them as replications.

I can put the year in the replication name, like "PI 313129_2013_1" (or drop the _1 if there is only one per year).

protein_oil-mini_core
315 rows of 12 traits
Seed protein content,Seed oil content,Palmitic acid content,Stearic acid content,Oleic acid content,linoleic acid content,Arachidic acid content,Gadoleic acid content,Behenic acid content,Lignoceric acid content,Unsaturated fat content,Seed oleic / linoleic acid ratio

What about the 2013 and 2015 tables (sheets)?
2013-orig-mini_core: 320 rows of 18 numeric traits (and 1 binary)
2015-orig-mini_core: 320 rows of 20 traits

svengato commented 3 years ago

Do we have location data for each accession? The Obs-mini_core table says they are from Citra, FL but we should use a more precise latitude & longitude to pinpoint them on the map.

sdash-github commented 3 years ago

Hi Sven, These accessions should be a subset of the data that is in the ArachisPheno private instance and hence we should have the GIS data.

sdash-github commented 3 years ago

What about the 2013 and 2015 tables (sheets)? 2013-orig-mini_core: 320 rows of 18 numeric traits (and 1 binary) 2015-orig-mini_core: 320 rows of 20 traits

We should conveniently ignore these two sheets as data in these has been summarized in the Obs-mini_core sheet.

svengato commented 3 years ago

These accessions should be a subset of the data that is in the ArachisPheno private instance and hence we should have the GIS data.

I looked up some of the accessions in the other data set. Some have no coordinates, others are listed in other countries (not Florida).

sdash-github commented 3 years ago

The accessions were grown in Citra Florida for observations but their origin are diverse from around the globe.  The coordinates when available should be outside FL depending on where the accession was collected from originally.

In other words, we only need Citra, FL coordinates for the study metadata and not as the coordinates of each of the accessions.

svengato commented 3 years ago

Summary after scrubbing:

Obs-mini_core
640 rows (replicates), usually 3 replicates per accession per year, occasional missing data for some traits.

protein_oil-mini_core
315 rows, usually 3 replicates per accession (min = 1, max = 4), no year information, no missing data

sdash-github commented 3 years ago

Are the total number of unique accessions involved not more than 107?

svengato commented 3 years ago

Obs-mini_core: 109 unique accessions protein_oil-mini_core: 108

sdash-github commented 3 years ago

I suppose >107 indicates a few commercial standards as in the citation, "The 107 mini-core lines were replicated once in each block, along with six commercial standards in each block." And perhaps, all 107 minicore accs didn't have result.

svengato commented 3 years ago

Final scrubbing tasks:

  1. Decide whether Obs-mini_core and protein_oil-mini_core should be separate studies.
  2. If not, how to assign a date (year) to the protein_oil-mini_core replicates.
sdash-github commented 3 years ago
  1. If not, how to assign a date (year) to the protein_oil-mini_core replicates.

I will not be able to assign this without any info to the effect in the spreadsheet. Does this mean we have to split the dataset into separate studies?

svengato commented 3 years ago

I am trying to determine whether a single study, in which each replicate is missing about half of the trait columns, would be a problem in ArachisPheno. Replicates like "PI nnnnnn_yyyy_k" would have the Obs traits, replicates like "PI nnnnnn_k" would have the protein-oil traits. But at least they would link to the same accession ("PI nnnnnn").

We could always try it this way, and split into two studies if the combined one is too cumbersome.

sdash-github commented 3 years ago

AFAIC, splitting into two different studies doesn't hurt, they are indeed meaningful subdivisions. Study-1: study of plant phenotypic/agronomic traits. Study-2: study of seed biochemical(protein/oil) traits.

Study-1: In future some one can do a stat analysis with year as a block effect. But this is not possible with the oil-protein data. Anyway, we provide the data from which ArachisPheno builds its db so other people can choose to analyse their way.

Thinking more about it, I am getting more inclined to support separation into two studies. This is more so because of the point I raised earlier that there is no correspondence between the replicates used for plant traits and oil-protein traits.

svengato commented 3 years ago

The public/minicore version of ArachisPheno is now up. I added the two studies separately but with the same description and DOI.

svengato commented 3 years ago

As in issue #12, we still need to add the phenotype metadata (units, etc).