legumeinfo / ArachisPheno

AraPheno source code for http://arapheno.1001genomes.org
MIT License
0 stars 0 forks source link

Loading test with csv file #5

Closed sdash-github closed 4 years ago

sdash-github commented 4 years ago

Document issues with test loading the data in csv format.

sdash-github commented 4 years ago

File for loading: file: phenotype_data_V1.sdWorkingCopy.csv
268258 Mar 30 13:48:43 2020 phenotype_data_V1.sdWorkingCopy.csv

sdash-github commented 4 years ago

Do the accession and replication_id have any external significance, or can I replace them with integers? (I think ArachisPheno expects integers.)

ADF: The accessions here are significant. You'll probably need to load their names into some table and then sub the integers it assigns them into a copy of the file. I think the replicate ids are just made up.

SG Right, I can make the accessions a table column corresponding to an arbitrary integer id. So far, I am having problems adding non-numerical values (experimenting with a smaller test file).

SD If that is an absolute requirement, we don't have a choice! But we would search and do our operations by our accession numbers rather than by their accession_id. I want to preserve both our accession numbers and our replication_id in the db. One choice is have another look up table to our accession-no and rep-id to their acc-id with foreign key relation defined. But before you do this I need to look at AraPheno and see how they are handling their operations involving the genotypes aka accessions. Also I am going to create a new issue and carry on this conversation rather than the emails. Sudhansu

svengato commented 4 years ago

Do our accessions have fields like latitude, longitude, country, species, collector, date (like the AraPheno examples)?

sdash-github commented 4 years ago

We should have the data at least for many of them. But it will take some time to look for them and compile them. We serve that data from our lis_germplasm module: https://peanutbase.org/germplasm/map
So if we have a list of accessions we should be able to pull out this passport data.

adf-ncgr commented 4 years ago

I guess I basically wrote the needed script when Carolina Chavarro was in touch earlier this year. presumably this will be an easier case with all accessions representing regular PI accession numbers and not a hodge-podge of different identifiers?

adf-ncgr commented 4 years ago

one other comment: it might be worth at least considering whether anything in the code actually depends on the primary key being an integer (it's convenient when the db is assigning them, but it's not clear to me at this point that it needs to be done this way). I am personally not a big fan of URLs like: https://arapheno.1001genomes.org/accession/6932/ which make it harder to link in from external systems that know it as "Ler-1" that said, it's probably not worth any major heroics, esp. as one could always do some sort of redirection magic for https://arapheno.1001genomes.org/accession/Ler-1 to work as desired

svengato commented 4 years ago

It looks like we have to add accessions before phenotypes anyway, otherwise uploading a phenotype throws an error like "Unknown accession with ID: 42". We could therefore:

  1. Add an explicit character-type primary key to the Accession model
  2. Export the accessions from PeanutBase, keep the relevant columns, and load them into the ArachisPheno database
  3. Then upload the phenotypes. (Possibly more steps than that)
sdash-github commented 4 years ago

Sven, Is it possible for you to generate relations diagram of the tables in the db, a schema diagram.  This will help me see things a little clearly. Also, where in the dev-arachispheno the sqlite file is located?

svengato commented 4 years ago

It is in ~/ArachisPheno/arapheno/db.sqlite3 Does a tool for generating such a diagram already exist?

sdash-github commented 4 years ago

I had assumed you might know some tools like schema-scrawler, etc. http://www.tobias-weis.de/short-schema-diagram-from-an-existing-sqlite-database/ https://www.schemacrawler.com/index.html https://github.com/schemacrawler/SchemaCrawler

But if it takes too much time, efforts to get it installed somewhere and get the entity diagrams, then you may decide for or against it. I will eventually look at the sqlite db and see for myself how they have organized the phenotype info in AraPheno.

svengato commented 4 years ago

I printed out the schema and will send it to you.

sdash-github commented 4 years ago

db.pdf Attaching Sven generated schema diagram (above).

svengato commented 4 years ago

There appear to be 2441 accessions in the PeanutBase lis_germplasm dataset, though not all of them use a "PInnnnnn" identifier. Andrew, you said there is a script for extracting those?

sdash-github commented 4 years ago

I am in the middle of preparing the passport data for our acc.

sdash-github commented 4 years ago

I have kept this file, phenoDataV1_passportdata.txt, in dev-arachispheno  in my home dir. I think you can access it, at least with sudo, else let me know.

This hasn't been formatted for the fields in the db table. Please let me know if you want that way and I will start formatting with those header names.

File stats : Total rows including replicated lines = 1007 Unique acc = 794 Acc with replications = 107 (uniq -d) # all have 3 reps except one acc with 2 reps (320 rows) Acc without replication = 687 (uniq -u)

svengato commented 4 years ago

Got it, thanks. Looking at it now -

sdash-github commented 4 years ago

FYI. The passport data file is ';' separates as some text might have coma within.

svengato commented 4 years ago

taxon;genus;species can be merged into species (code 1 for hypogaea, 2 for duranensis, 3 for ipaensis if any). accenumb -> name (also, it appears twice - maybe one is accession_id ?) origcty -> country collsite -> sitename (and can I eliminate the "Lat/Lon accurate to ..."?) latitude;longitude;elevation -> not necessary latdec;longdec -> latitude;longitude geographic_coord -> not necessary (unless we can use it to determine latitude and longitude for those whose values are 0?)

If there is no collection_date, collector, cs_number (= stock center number), those will be blank (which is okay).

svengato commented 4 years ago

Also, is this set of accessions considered a "genotype"? (for ArachisPheno purposes) If so, is it of type SNP chip, Full sequence, Imputed full sequence, xor RNA sequence?

sdash-github commented 4 years ago

The accessions are more actually genetic lines. We do not have genotype data yet. I would not worry about filling anything related to genotype data in the db now.

On 2020/3/31 1:13 PM, svengato wrote:

Also, is this set of accessions considered a "genotype"? (for ArachisPheno purposes) If so, is it of type SNP chip, Full sequence, Imputed full sequence, xor RNA sequence?

adf-ncgr commented 4 years ago

as far as I can tell, arapheno's concept of genotype is simply whether or not a given accession has data (somewhere other than arapheno) with respect to a given platform. See the filter at the top of https://arapheno.1001genomes.org/accessions/ with the dropdown of choices.

we could probably ignore it or perhaps repurpose it slightly to indicate membership of the accession in some designated set (e.g. "core" or "minicore"). At the end of the day, it seems to do nothing more than provide a convenient subsetting mechanism (unless I'm missing something)

svengato commented 4 years ago

Yes, we can ignore it for now.

sdash-github commented 4 years ago

On 2020/3/31 1:08 PM, svengato wrote:

taxon;genus;species can be merged into species (code 1 for hypogaea, 2 for duranensis, 3 for ipaensis if any).

I think you mean the 'phenotype_species id'.

accenumb -> name (also, it appears twice - maybe one is accession_id ?) origcty -> country collsite -> sitename (and can I eliminate the "Lat/Lon accurate to ..."?)

But this field doesn't have the Lat/Lon values ? they are the next two fields in the data file.

latitude;longitude;elevation -> not necessary latdec;longdec -> latitude;longitude geographic_coord -> not necessary (unless we can use it to determine latitude and longitude for those whose values are 0?)

We may not use this for now.

If there is no collection_date, collector, cs_number (= stock center number), those will be blank (which is okay).

We may not use this for now.

svengato commented 4 years ago

I think you mean the 'phenotype_species id'.

Correct.

collsite -> sitename (and can I eliminate the "Lat/Lon accurate to ..."?)

But this field doesn't have the Lat/Lon values ? they are the next two fields in the data file.

Right, I just want to shorten the string.

svengato commented 4 years ago

I spent some time experimenting with the REST API to confirm that it uses the [accession / phenotype / study] id rather than name, and replaced references to arapheno.1001genomes.org with our URL (in a file hidden from GitHub), so that the REST API will work properly. (Still to do: how to embed it in the HTML files that describe the REST API.) I also wrote a script to recreate a pristine database (with no accessions, phenotypes, or studies) if necessary.

Now I am reviewing all the details before converting the accession id to a string, after which I can rebuild the database and try to load the accessions file (or a converted version of it). The main questions are

  1. If the accession id is a string, will the phenotype-uploading process be able to handle it?
  2. Will we need to display or use the replicate id from the phenotypes file? I think AraPheno ignores that field.
sdash-github commented 4 years ago

Will we need to display or use the replicate id from the phenotypes file? I think AraPheno ignores that field.

We most likely won't use the rep_id for querying; but when we query in a way that the data is reported for each of the 3 reps for some accessions, the rep id should be preserved, so that we know which of the reps certain data refers to in case there is a question and should be traceable to the original data file.  That is my purpose of preserving the rep-id in the db.

svengato commented 4 years ago

I converted your accessions CSV file to a JSON file (there was one quoted field in a collection site that took a while to track down). It successfully imported in a test, using an integer id.

Current question: if the phenotype's accession id is a string, does the accession's own id have to be a string as well? (so they can match)

sdash-github commented 4 years ago

Current question: if the phenotype's accession id is a string, does the accession's own id have to be a string as well? (so they can match)

I am not able to wrap my head around "the phenotype's accession id" and the "accession's own id" .  I think @adf Andrew can catch it and respond.

Just saw the phenotypes coming up in the webpage. Congratulations! http://dev.lis.ncgr.org:50007/accessions/?page=1

adf-ncgr commented 4 years ago

I'm not sure I understood the question; if talking about a foreign key -> primary key relationship I think they'd have to match- probably a bad idea to do type-casting in that context, even if permitted.

sdash-github commented 4 years ago

Nice to see the collection site in the map. http://dev.lis.ncgr.org:50007/accession/774/

svengato commented 4 years ago

Just saw the phenotypes coming up in the webpage. Congratulations!

Those are the accessions, I still have to add the phenotypes. One thing I can do tonight is try to upload your phenotype CSV file, using integer ids, possibly without the non-numeric columns.

adf-ncgr commented 4 years ago

Nice to see the collection site in the map. http://dev.lis.ncgr.org:50007/accession/774/

I know I will regret making this comment, but: if judged solely by the street names given in that particular map example, one might well be led to believe that peanuts really are the musical fruit... ;)

sdash-github commented 4 years ago

Sitename: Village market, Mercado La Union, Trujillo

And nice little village market with bus services, a university and lots of suburbs !!

svengato commented 4 years ago

if judged solely by the street names given in that particular map example, one might well be led to believe that peanuts really are the musical fruit... ;)

The BBC could use the map to send Michael Palin on a series of Great Legume Journeys!

Chopin and Liszt comics - http://www.harkavagrant.com/index.php?id=302

svengato commented 4 years ago

Anyway, I tried to upload a modified version of the phenotype file, with the two non-numeric columns removed, integer ids matching those in the accessions file, "na" replaced with blanks, and '/' and '.' replaced with '_' in the column names. But it gives an error: "could not convert string to float: #DIV/0!"

svengato commented 4 years ago

There were some literal "#DIV/0!" in the file, so I replaced them with blanks and tried again. It accepted the submission, http://dev.lis.ncgr.org:50007/submission/ad6dc3b3-2c27-4603-976b-2269b33750cd/

svengato commented 4 years ago

It also appears to be trying to send me an e-mail message about the submission status, though it has not yet arrived. The sender (From field) is Ümit (from AraPheno), not me, so the e-mail server is probably incorrectly configured too.

svengato commented 4 years ago

I went to the admin page and set its status to Published, but without curating it (setting the ontology fields, etc). Now you can view the phenotypes under Public Phenotypes (and the study, under Public Studies).

Summary: the user uploads a phenotype table, then the admins/curators add relevant information and update its status. Once published, the submission becomes a Study.

Also confirmed that it is possible to go back and curate it (change status from Published to Curation).

adf-ncgr commented 4 years ago

looks like good progress for goober peas; a couple of things I noticed:

sdash-github commented 4 years ago

Inconsistency in the number of accessions: This file (study) has Unique acc = 794. The page with this link http://dev.lis.ncgr.org:50007/phenotype/5/ shows 'Geographic distribution of 876 accessions' above the world map. But the table view in the same page has correct number of accessions: 793. Just for records for now.  This is probably not the time to solve all these problems.

svengato commented 4 years ago

link to AraGWAS Associations:AraGWAS:3 probably needs to be suppressed.

Removed this. Also noticed that DOI citations are incorrect, the DataCite link is still set to what it was for AraPheno.

svengato commented 4 years ago

876 includes all replicates with a non-blank value for that phenotype.

sdash-github commented 4 years ago

At some point in the near future we should change the code for this figure to reflect the number of accessions, rather than the count based on the number of replicates. In the map context it doesn't make much sense.

On 2020/4/1 9:58 AM, svengato wrote:

876 includes all replicates with a non-blank value for that phenotype.

svengato commented 4 years ago

I am personally not a big fan of URLs like: https://arapheno.1001genomes.org/accession/6932/

I am getting close to testing the change in accession id from integer to character. Just to confirm, this is only for the accession (and replicate) ids? Not phenotypes or studies, which may have a longer name?

sdash-github commented 4 years ago

Yes, URLs like https://arapheno.1001genomes.org/accession/IPnnnnn/ would very user friendly. Accessions and rep-ids (preferably replicate name like IP12345_2) would be great.

Now that you are giving me to choose, it will be nice to have a phenotype name as a URL parameter like: https://arapheno..........org/phenotype/?study=2&phenotype=leaf_width I am not worried about study name now.

Thanks for bringing these options up.

svengato commented 4 years ago

Let me think about it - what if two studies have a phenotype with the same name? There might be a conflict when you submit the second study.

sdash-github commented 4 years ago

if two studies have a phenotype with the same name? It is more likely this would be the case. For example leaf_width will be likely the same name in another study. The combination of the particular study and the phenotype would make it unique.

svengato commented 4 years ago

I looked in AraPheno and found an example: two studies have a phenotype called As75, but they have different ids and are treated separately. Other examples: Ca43, Co59, Cu65.

sdash-github commented 4 years ago

found an example: two studies have a phenotype called As75, but they have different ids Superficially, without having seen the details in the schema, it sound like not a well normalized schema, at least this part.

sdash-github commented 4 years ago

I want to re-think what I just said about reuse of penotype-id in multiple studies. We might already be doing a similar thing in my ongenome schema. I want to think more to be clear. Either way, not re-using phenotype (same phenotype having different id in another study) in multiple studies should not be a big deal.