Closed sdash-github closed 4 years ago
File for loading:
file: phenotype_data_V1.sdWorkingCopy.csv
268258 Mar 30 13:48:43 2020 phenotype_data_V1.sdWorkingCopy.csv
Do the accession and replication_id have any external significance, or can I replace them with integers? (I think ArachisPheno expects integers.)
ADF: The accessions here are significant. You'll probably need to load their names into some table and then sub the integers it assigns them into a copy of the file. I think the replicate ids are just made up.
SG Right, I can make the accessions a table column corresponding to an arbitrary integer id. So far, I am having problems adding non-numerical values (experimenting with a smaller test file).
SD If that is an absolute requirement, we don't have a choice! But we would search and do our operations by our accession numbers rather than by their accession_id. I want to preserve both our accession numbers and our replication_id in the db. One choice is have another look up table to our accession-no and rep-id to their acc-id with foreign key relation defined. But before you do this I need to look at AraPheno and see how they are handling their operations involving the genotypes aka accessions. Also I am going to create a new issue and carry on this conversation rather than the emails. Sudhansu
Do our accessions have fields like latitude, longitude, country, species, collector, date (like the AraPheno examples)?
We should have the data at least for many of them. But it will take some time to look for them and compile them. We serve that data from our lis_germplasm module: https://peanutbase.org/germplasm/map
So if we have a list of accessions we should be able to pull out this passport data.
I guess I basically wrote the needed script when Carolina Chavarro was in touch earlier this year. presumably this will be an easier case with all accessions representing regular PI accession numbers and not a hodge-podge of different identifiers?
one other comment: it might be worth at least considering whether anything in the code actually depends on the primary key being an integer (it's convenient when the db is assigning them, but it's not clear to me at this point that it needs to be done this way). I am personally not a big fan of URLs like: https://arapheno.1001genomes.org/accession/6932/ which make it harder to link in from external systems that know it as "Ler-1" that said, it's probably not worth any major heroics, esp. as one could always do some sort of redirection magic for https://arapheno.1001genomes.org/accession/Ler-1 to work as desired
It looks like we have to add accessions before phenotypes anyway, otherwise uploading a phenotype throws an error like "Unknown accession with ID: 42". We could therefore:
Sven, Is it possible for you to generate relations diagram of the tables in the db, a schema diagram. This will help me see things a little clearly. Also, where in the dev-arachispheno the sqlite file is located?
It is in ~/ArachisPheno/arapheno/db.sqlite3 Does a tool for generating such a diagram already exist?
I had assumed you might know some tools like schema-scrawler, etc. http://www.tobias-weis.de/short-schema-diagram-from-an-existing-sqlite-database/ https://www.schemacrawler.com/index.html https://github.com/schemacrawler/SchemaCrawler
But if it takes too much time, efforts to get it installed somewhere and get the entity diagrams, then you may decide for or against it. I will eventually look at the sqlite db and see for myself how they have organized the phenotype info in AraPheno.
I printed out the schema and will send it to you.
db.pdf Attaching Sven generated schema diagram (above).
There appear to be 2441 accessions in the PeanutBase lis_germplasm dataset, though not all of them use a "PInnnnnn" identifier. Andrew, you said there is a script for extracting those?
I am in the middle of preparing the passport data for our acc.
I have kept this file, phenoDataV1_passportdata.txt, in dev-arachispheno in my home dir. I think you can access it, at least with sudo, else let me know.
This hasn't been formatted for the fields in the db table. Please let me know if you want that way and I will start formatting with those header names.
File stats : Total rows including replicated lines = 1007 Unique acc = 794 Acc with replications = 107 (uniq -d) # all have 3 reps except one acc with 2 reps (320 rows) Acc without replication = 687 (uniq -u)
Got it, thanks. Looking at it now -
FYI. The passport data file is ';' separates as some text might have coma within.
taxon;genus;species can be merged into species (code 1 for hypogaea, 2 for duranensis, 3 for ipaensis if any). accenumb -> name (also, it appears twice - maybe one is accession_id ?) origcty -> country collsite -> sitename (and can I eliminate the "Lat/Lon accurate to ..."?) latitude;longitude;elevation -> not necessary latdec;longdec -> latitude;longitude geographic_coord -> not necessary (unless we can use it to determine latitude and longitude for those whose values are 0?)
If there is no collection_date, collector, cs_number (= stock center number), those will be blank (which is okay).
Also, is this set of accessions considered a "genotype"? (for ArachisPheno purposes) If so, is it of type SNP chip, Full sequence, Imputed full sequence, xor RNA sequence?
The accessions are more actually genetic lines. We do not have genotype data yet. I would not worry about filling anything related to genotype data in the db now.
On 2020/3/31 1:13 PM, svengato wrote:
Also, is this set of accessions considered a "genotype"? (for ArachisPheno purposes) If so, is it of type SNP chip, Full sequence, Imputed full sequence, xor RNA sequence?
as far as I can tell, arapheno's concept of genotype is simply whether or not a given accession has data (somewhere other than arapheno) with respect to a given platform. See the filter at the top of https://arapheno.1001genomes.org/accessions/ with the dropdown of choices.
we could probably ignore it or perhaps repurpose it slightly to indicate membership of the accession in some designated set (e.g. "core" or "minicore"). At the end of the day, it seems to do nothing more than provide a convenient subsetting mechanism (unless I'm missing something)
Yes, we can ignore it for now.
On 2020/3/31 1:08 PM, svengato wrote:
taxon;genus;species can be merged into species (code 1 for hypogaea, 2 for duranensis, 3 for ipaensis if any).
I think you mean the 'phenotype_species id'.
accenumb -> name (also, it appears twice - maybe one is accession_id ?) origcty -> country collsite -> sitename (and can I eliminate the "Lat/Lon accurate to ..."?)
But this field doesn't have the Lat/Lon values ? they are the next two fields in the data file.
latitude;longitude;elevation -> not necessary latdec;longdec -> latitude;longitude geographic_coord -> not necessary (unless we can use it to determine latitude and longitude for those whose values are 0?)
We may not use this for now.
If there is no collection_date, collector, cs_number (= stock center number), those will be blank (which is okay).
We may not use this for now.
I think you mean the 'phenotype_species id'.
Correct.
collsite -> sitename (and can I eliminate the "Lat/Lon accurate to ..."?)
But this field doesn't have the Lat/Lon values ? they are the next two fields in the data file.
Right, I just want to shorten the string.
I spent some time experimenting with the REST API to confirm that it uses the [accession / phenotype / study] id rather than name, and replaced references to arapheno.1001genomes.org with our URL (in a file hidden from GitHub), so that the REST API will work properly. (Still to do: how to embed it in the HTML files that describe the REST API.) I also wrote a script to recreate a pristine database (with no accessions, phenotypes, or studies) if necessary.
Now I am reviewing all the details before converting the accession id to a string, after which I can rebuild the database and try to load the accessions file (or a converted version of it). The main questions are
Will we need to display or use the replicate id from the phenotypes file? I think AraPheno ignores that field.
We most likely won't use the rep_id for querying; but when we query in a way that the data is reported for each of the 3 reps for some accessions, the rep id should be preserved, so that we know which of the reps certain data refers to in case there is a question and should be traceable to the original data file. That is my purpose of preserving the rep-id in the db.
I converted your accessions CSV file to a JSON file (there was one quoted field in a collection site that took a while to track down). It successfully imported in a test, using an integer id.
Current question: if the phenotype's accession id is a string, does the accession's own id have to be a string as well? (so they can match)
Current question: if the phenotype's accession id is a string, does the accession's own id have to be a string as well? (so they can match)
I am not able to wrap my head around "the phenotype's accession id" and the "accession's own id" . I think @adf Andrew can catch it and respond.
Just saw the phenotypes coming up in the webpage. Congratulations! http://dev.lis.ncgr.org:50007/accessions/?page=1
I'm not sure I understood the question; if talking about a foreign key -> primary key relationship I think they'd have to match- probably a bad idea to do type-casting in that context, even if permitted.
Nice to see the collection site in the map. http://dev.lis.ncgr.org:50007/accession/774/
Just saw the phenotypes coming up in the webpage. Congratulations!
Those are the accessions, I still have to add the phenotypes. One thing I can do tonight is try to upload your phenotype CSV file, using integer ids, possibly without the non-numeric columns.
Nice to see the collection site in the map. http://dev.lis.ncgr.org:50007/accession/774/
I know I will regret making this comment, but: if judged solely by the street names given in that particular map example, one might well be led to believe that peanuts really are the musical fruit... ;)
Sitename: Village market, Mercado La Union, Trujillo
And nice little village market with bus services, a university and lots of suburbs !!
if judged solely by the street names given in that particular map example, one might well be led to believe that peanuts really are the musical fruit... ;)
The BBC could use the map to send Michael Palin on a series of Great Legume Journeys!
Chopin and Liszt comics - http://www.harkavagrant.com/index.php?id=302
Anyway, I tried to upload a modified version of the phenotype file, with the two non-numeric columns removed, integer ids matching those in the accessions file, "na" replaced with blanks, and '/' and '.' replaced with '_' in the column names. But it gives an error: "could not convert string to float: #DIV/0!"
There were some literal "#DIV/0!" in the file, so I replaced them with blanks and tried again. It accepted the submission, http://dev.lis.ncgr.org:50007/submission/ad6dc3b3-2c27-4603-976b-2269b33750cd/
It also appears to be trying to send me an e-mail message about the submission status, though it has not yet arrived. The sender (From field) is Ümit (from AraPheno), not me, so the e-mail server is probably incorrectly configured too.
I went to the admin page and set its status to Published, but without curating it (setting the ontology fields, etc). Now you can view the phenotypes under Public Phenotypes (and the study, under Public Studies).
Summary: the user uploads a phenotype table, then the admins/curators add relevant information and update its status. Once published, the submission becomes a Study.
Also confirmed that it is possible to go back and curate it (change status from Published to Curation).
looks like good progress for goober peas; a couple of things I noticed:
Inconsistency in the number of accessions: This file (study) has Unique acc = 794. The page with this link http://dev.lis.ncgr.org:50007/phenotype/5/ shows 'Geographic distribution of 876 accessions' above the world map. But the table view in the same page has correct number of accessions: 793. Just for records for now. This is probably not the time to solve all these problems.
link to AraGWAS Associations:AraGWAS:3 probably needs to be suppressed.
Removed this. Also noticed that DOI citations are incorrect, the DataCite link is still set to what it was for AraPheno.
876 includes all replicates with a non-blank value for that phenotype.
At some point in the near future we should change the code for this figure to reflect the number of accessions, rather than the count based on the number of replicates. In the map context it doesn't make much sense.
On 2020/4/1 9:58 AM, svengato wrote:
876 includes all replicates with a non-blank value for that phenotype.
I am personally not a big fan of URLs like: https://arapheno.1001genomes.org/accession/6932/
I am getting close to testing the change in accession id from integer to character. Just to confirm, this is only for the accession (and replicate) ids? Not phenotypes or studies, which may have a longer name?
Yes, URLs like https://arapheno.1001genomes.org/accession/IPnnnnn/ would very user friendly. Accessions and rep-ids (preferably replicate name like IP12345_2) would be great.
Now that you are giving me to choose, it will be nice to have a phenotype name as a URL parameter like: https://arapheno..........org/phenotype/?study=2&phenotype=leaf_width I am not worried about study name now.
Thanks for bringing these options up.
Let me think about it - what if two studies have a phenotype with the same name? There might be a conflict when you submit the second study.
if two studies have a phenotype with the same name? It is more likely this would be the case. For example leaf_width will be likely the same name in another study. The combination of the particular study and the phenotype would make it unique.
I looked in AraPheno and found an example: two studies have a phenotype called As75, but they have different ids and are treated separately. Other examples: Ca43, Co59, Cu65.
found an example: two studies have a phenotype called As75, but they have different ids Superficially, without having seen the details in the schema, it sound like not a well normalized schema, at least this part.
I want to re-think what I just said about reuse of penotype-id in multiple studies. We might already be doing a similar thing in my ongenome schema. I want to think more to be clear. Either way, not re-using phenotype (same phenotype having different id in another study) in multiple studies should not be a big deal.
Document issues with test loading the data in csv format.