nwfsc-fram / research-catch-db

0 stars 2 forks source link

Compare CouchDB Taxonomy Data with research-catch-db Taxonomy Data #37

Closed jimfellows-NOAA closed 4 years ago

jimfellows-NOAA commented 4 years ago

@BHHorness-NOAA @SaOgaz-NOAA

Looking into flattening out taxonomy data from CouchDB and comparing it to the data in the research-catch-db taxonomy data in Postgres. The main goal here is to use the data in couch as the gold standard and sync Postgres using that data.

Steps:

Trying to assign this to myself, but I don't seem to have that option...

jimfellows-NOAA commented 4 years ago

I've flattened the couch data and have joined it to the PostgreSQL research_catch.species_lu table. That solves the reporting issue, I have it saved to a Zeppelin Notebook on my machine. We should be able to update directly from CouchDB, but I don't think Drill can write to RDBMS. Looking into options...

jimfellows-NOAA commented 4 years ago

Don't think using Zeppelin/Drill to make the actual update is the best choice, but it could definitely be automated using Python or something else if we want to sync these tables on a regular basis.

Out of the 161 records in SPECIES_LU, there are a 17 common names that weren't an exact match in couch, but the spreadsheet has the closest string match available, and most look pretty close. Couch also has two taxonomy records for Rock Sole, one at the Genus level and one at the species level, I'm guessing we'll want both genus and species inserted to the Postgres scientific name, but that's another decision that has to be made.

@SaOgaz-NOAA let me know if you want to talk about next steps for making the initial update to the data. I see 1604 distinct species-level scientific names in Couch, which is a lot more than those in Postgres. Should all of these eventually make it to research_catch?

I've dropped results of the comparison between couch and postgres here: \nwcfile\FRAM\Users\James.Fellows\pg_to_couch_taxonomies\join_on_common_name.xlsx

SaOgaz-NOAA commented 4 years ago

@jimfellows-NOAA thanks! We should definitely tag up to talk, but it might not be until further into next week. I'll let you know when I'm ready.

SaOgaz-NOAA commented 4 years ago

@jimfellows-NOAA so I'm following up with Kate and Kayleigh about the almost or completely unmatched ones.

To answer your question above, we definitely don't want all the couch names, just the matches for the common names currently in the postgres DB.

Did the list I gave you include "starry flounder"? I forget which version of the spreadsheet I gave you and I don't see that one on your excel sheet.

jimfellows-NOAA commented 4 years ago

Added in Starry Flounder to that shared sheet in my drive folder, must've not been in the table when I ran things the first time. Also made suggestions for the couple mismatches that didn't really make sense with the closest string match, we'll see what Kate and Kayleigh have to say.

jimfellows-NOAA commented 4 years ago

itis_tsn field has been added to SPECIES_LU in DEV, and that field plus scientific_name have been updated with the corresponding values from Couch taxonomies, matched based on the common name.

Per the observer analysts, I've left the following records with null itis_tsn and scientific_name values:

SaOgaz-NOAA commented 4 years ago

Thanks for much for all the help on this @jimfellows-NOAA! Closing this ticket.