mbari-org / SeafloorMappingDB

Make MBARI seafloor mapping datasets more accessible and useful
GNU General Public License v3.0
3 stars 6 forks source link

Write script to generate yearly comment .csv files, add load step to load data from .xlsx files derived from them #206

Open MBARIMike opened 2 months ago

MBARIMike commented 2 months ago

Pulled from an email message...

On May 2, 2024, at 9:38 AM, Jenny Paduan paje@mbari.org wrote:

Hi Mike and Karen, Do you think a file like this (attached, and in: /Volumes/SeafloorMapping/2023/SMDB/SMDB_2023_survey_tally.xlsx ) would serve as the method for SMDB to determine whether a survey was successful/failed/test/etc., whether there was a patch test, and which vehicle was used? I think it's what we decided was the best way to handle this, and it's become part of my workflow as I've been keeping versions of such a tally like this to maintain my sanity during each expedition. Please advise if you like the chosen vocabulary or not, or if there'd be a better way to capture this info.

Note it's in a directory SMDB, which could appear in each of the year directories with a similar "SMDB_year_survey_tally.xlsx" file included for the SMDB load to interact with. Other things you want to have available for the SMDB load could go in these new directories (don't know what those would be yet, just making the suggestion).

For prior years, since it didn't used to be part of anyone's workflow, for the older surveys I can use the spreadsheet Mike made for the trackline length project ITD asked for a couple years ago to create an annual tally like this. Then going forward, can "km_trackline" to be added to it, or some "update" file created periodically with the data in it to transcribe into this Excel file?

Another thing to be filled in later is to supply the MGDS compilation name as data get submitted to that repository. The URL at MGDS for the 2023 Mid-Atlantic Ridge surveys isn't available yet to include in the above spreadsheet, so this stands as an example of how the table would have to evolve and be re-read by the SMDB load.

MBARIMike commented 2 months ago

I'm starting on a survey_tally.py module that will read data from the .xlsx files in the YYYY/SMDB directories and update the database fields during the nightly loads. The module will also have a method to write .csv files in those directories reflecting the content of the database. In order to cause an update of the database field values the .csv can be imported into the .xlsx where edits can be made.

MBARIMike commented 2 months ago

Examining the database structure I wonder if it might be time to revisit whether we have all the fields we want. For example, I see a LASS checkbox column in the spreadsheet, but that field is not in Mission. Mission is also missing a Route field. Maybe we should add those, and maybe more?

MBARIMike commented 2 months ago

This latest PR adds columns missing in Mission but present in the .xlsx files. (Some names have been adjusted.)

MBARIMike commented 2 months ago

The last PR enables loading from .xlsx and writing to .csv files. Testing on dev machine on db with 3 Missions:

➜  /app git:(main) smdb/scripts/load.py --spreadsheets -v
INFO 2024-05-15 20:00:15,929 survey_tally.py process_xlsx():165 Processing 2022
INFO 2024-05-15 20:00:15,929 survey_tally.py read_xlsx_into_df():130 Reading /mbari/SeafloorMapping/2022/SMDB/SMDB_2022_survey_tally.xlsx
INFO 2024-05-15 20:00:16,197 survey_tally.py update_db_from_df():159 Updated 2022/20220207m1
INFO 2024-05-15 20:00:16,200 survey_tally.py update_db_from_df():159 Updated 2022/20220207m2
INFO 2024-05-15 20:00:16,203 survey_tally.py update_db_from_df():159 Updated 2022/20220206m2
INFO 2024-05-15 20:00:16,351 survey_tally.py process_csv():243 Writing /mbari/SeafloorMapping/2022/SMDB/SMDB_2022_survey_tally.csv
INFO 2024-05-15 20:00:19,065 load.py save_logger_output():264 Saving to local log file: /etc/smdb/load.txt
INFO 2024-05-15 20:00:19,067 load.py save_logger_output():265 Saving to media log file: logs/load.txt
INFO 2024-05-15 20:00:19,067 load.py save_logger_output():266 Elapsed time: 0:00:21.413516
MBARIMike commented 1 month ago

@jbpaduan The changes in the PR are now rolled out into production with nightly checks for updated .xlsx files and creation of all *survey_tally.csv files.

jbpaduan commented 1 month ago

The column headings are changed to (see SMDB_2024_survey_tally.xlsx): Mission,Route,Location,Vehicle,Status_a,Status_b,Status_c*,Comment,Trackline_km,MGDS_compilation The asterisks refer to the footer of the table, where vocabulary is given. That vocabulary has also been modified.

In the Vehicle column, I expect to use MAUV1, MAUV2, LASS, Sentry, and more with time (note the separate columns for AUV and LASS are gone).

For datalists, "production survey" should be assigned a 100; "use_with_caution" should be assigned 1.0; "repeat_survey" should be assigned 1000; test_survey should be assigned 0.001. The user will need to change these as actually appropriate, this is intended to make them aware they may need to. Example for a production_survey: /2022/20220902m1/datalist.mb-1 -1 100

The "MGDS_compilation" URL prefix should be (for now): https://www.marine-geo.org/tools/search/entry.php?id=

jbpaduan commented 1 month ago

Looks like 2023's CSV file is missing all of the PacNW expedition. In the 2022 CSV, I see that I need to add quite a few directories to the exclude list (e.g., Arctic/ARA13C/MBARI/MBARI_data/AUV_data/* and CHRT_MBSystem/SampleData/; I don't want to delete the extra directories because Dave wants to preserve everything.)

MBARIMike commented 1 month ago

Regarding https://github.com/mbari-org/SeafloorMappingDB/issues/206#issuecomment-2125706425, if a Mission can have multiple Status* values then I can model this as a many-to-many in the database and the field in the spreadsheet can space delimit those values. I think this is a cleaner way to do this.

jbpaduan commented 1 month ago

OK, the 2024 tally is updated with what I think you are looking for.

MBARIMike commented 1 month ago

The above changes have been merged into production. Here's a URL that shows some of the entries for the 2024 tally file: https://smdb.shore.mbari.org/missions/?name=2024%2F202405&region_name=&patch_test=&repeat_survey=&mgds_compilation=&expedition__name=&submit=Filter&sort=-region_name

and what I see in my browser:

Screenshot 2024-05-24 at 15 57 19
jbpaduan commented 1 month ago

This is working great on the Missions page! Will it break if I add a mission in the spreadsheet that should be on record as a failed survey but for which the data were never processed? (I don't expect it to be found by the load script, since there's no ZTopo.grd, but I don't want the load to break if I've made a place-holder in the spreadsheet; MAOps2016 is an example.)

MBARIMike commented 1 month ago

It will not break the load if there is a mission in the spreadsheet that's not in the database. Instead, a WARNING message will be written to the log that looks something like:

WARNING 2024-06-02 19:56:20,371 load.py update_db_from_df():1646 Not found in database: 2023/20230830m1