Open svengato opened 3 years ago
- Obs-mini_core sheet
- protein_oil-mini_core sheet
Do we consider these separate studies?
What does "minicore" mean? Did the previous unpublished data set have a name?
(Sudhansu's replies)
What does "minicore" mean?:
A very minimal set of accessions that represents the variability of traits of interest in the collection of accessions.
Did the previous unpublished data set have a name?:
I believe it was the peanut core collection. It is a larger collection than the minicore and should include the minicore as a subset.
Do we consider these separate studies?:
I will come back to you on this question after I read some some more information. A good question indeed. Thanks for going through the details.
Do we consider these separate studies?:
I will come back to you on this question after I read some some more
information. A good question indeed. Thanks for going through the
details.
a.
The data for traits in both the sheets were collected from the same
replicated study in Florida. Biochemical data, protein and oil, was
from the same plants that were used earlier in the season for
measuring other phenotype data. So, per this information, they
should be treated as one study.
b.
If it is necessary to identify the replicates that were used for
phenotype measurements and correspond/map them with the replicates
used for oil-protein studies, then I don't have that information. If
this correspondence is necessary for statistical purpose, we may
need to treat them as separate studies and make this clear in our
metadata. If this distinction isn't really necessary for our
database, then it is the same study. I seem to have forgotten how
we treated the last dataset that has been kept private.
Citation: Otyama, P. I., Wilkey, A., Kulkarni, R., Assefa, T., Chu, Y., Clevenger, J., O'Connor, D. J., Wright, G. C., Dezern, S. W., MacDonald, G. E., Anglin, N. L., Cannon, E., Ozias-Akins, P., & Cannon, S. B. (2019). Evaluation of linkage disequilibrium, population structure, and genetic diversity in the U.S. peanut mini core collection. /BMC genomics/, /20/(1), 481. https://doi.org/10.1186/s12864-019-5824-9
The traits in the two tables do not overlap, and could therefore be merged if appropriate.
Obs-mini_core
12479 rows, one trait per row (long format), 20 numeric traits, 1 binary trait
Seed weight,Hull weight,Extra large kernel weight,Medium kernel weight,#1 kernel weight,All other seed types weight,Percentage split kernels,Percentage Sound Mature Kernel,Meat/hull ratio,Pod weight per plot,Pod weight per hecatare,Plant height,Canopy width,Plant height/width ratio,Average leaflet length,Average leaflet width,Leaflet length/width ratio,Pod volume,Fancy pods,Percentage Fancy Pods,Main stem flower
These phenotype data should have two observations for each accession, one for year 2013 and the other for year 2015. I think we should be able to treat them as replications.
I can put the year in the replication name, like "PI 313129_2013_1" (or drop the _1 if there is only one per year).
protein_oil-mini_core
315 rows of 12 traits
Seed protein content,Seed oil content,Palmitic acid content,Stearic acid content,Oleic acid content,linoleic acid content,Arachidic acid content,Gadoleic acid content,Behenic acid content,Lignoceric acid content,Unsaturated fat content,Seed oleic / linoleic acid ratio
What about the 2013 and 2015 tables (sheets)?
2013-orig-mini_core: 320 rows of 18 numeric traits (and 1 binary)
2015-orig-mini_core: 320 rows of 20 traits
Do we have location data for each accession? The Obs-mini_core table says they are from Citra, FL but we should use a more precise latitude & longitude to pinpoint them on the map.
Hi Sven, These accessions should be a subset of the data that is in the ArachisPheno private instance and hence we should have the GIS data.
What about the 2013 and 2015 tables (sheets)? 2013-orig-mini_core: 320 rows of 18 numeric traits (and 1 binary) 2015-orig-mini_core: 320 rows of 20 traits
We should conveniently ignore these two sheets as data in these has been summarized in the Obs-mini_core sheet.
These accessions should be a subset of the data that is in the ArachisPheno private instance and hence we should have the GIS data.
I looked up some of the accessions in the other data set. Some have no coordinates, others are listed in other countries (not Florida).
The accessions were grown in Citra Florida for observations but their origin are diverse from around the globe. The coordinates when available should be outside FL depending on where the accession was collected from originally.
In other words, we only need Citra, FL coordinates for the study metadata and not as the coordinates of each of the accessions.
Summary after scrubbing:
Obs-mini_core
640 rows (replicates), usually 3 replicates per accession per year, occasional missing data for some traits.
protein_oil-mini_core
315 rows, usually 3 replicates per accession (min = 1, max = 4), no year information, no missing data
Are the total number of unique accessions involved not more than 107?
Obs-mini_core: 109 unique accessions protein_oil-mini_core: 108
I suppose >107 indicates a few commercial standards as in the citation, "The 107 mini-core lines were replicated once in each block, along with six commercial standards in each block." And perhaps, all 107 minicore accs didn't have result.
Final scrubbing tasks:
- If not, how to assign a date (year) to the protein_oil-mini_core replicates.
I will not be able to assign this without any info to the effect in the spreadsheet. Does this mean we have to split the dataset into separate studies?
I am trying to determine whether a single study, in which each replicate is missing about half of the trait columns, would be a problem in ArachisPheno. Replicates like "PI nnnnnn_yyyy_k" would have the Obs traits, replicates like "PI nnnnnn_k" would have the protein-oil traits. But at least they would link to the same accession ("PI nnnnnn").
We could always try it this way, and split into two studies if the combined one is too cumbersome.
AFAIC, splitting into two different studies doesn't hurt, they are indeed meaningful subdivisions. Study-1: study of plant phenotypic/agronomic traits. Study-2: study of seed biochemical(protein/oil) traits.
Study-1: In future some one can do a stat analysis with year as a block effect. But this is not possible with the oil-protein data. Anyway, we provide the data from which ArachisPheno builds its db so other people can choose to analyse their way.
Thinking more about it, I am getting more inclined to support separation into two studies. This is more so because of the point I raised earlier that there is no correspondence between the replicates used for plant traits and oil-protein traits.
The public/minicore version of ArachisPheno is now up. I added the two studies separately but with the same description and DOI.
As in issue #12, we still need to add the phenotype metadata (units, etc).
from Sudhansu: