cleaning phen and seed data from USDA

lizzieinvancouver commented 2 weeks ago

Copied from issue #36 by @selenashew ...

The phen & seed data have been cleaned and can be found in the output folder as "cleaned_phen_data_final.csv" and "cleaned_seed_data_final.csv". The cleaning script used is found in the cleaning folder as "phen_seed_data_final_cleaning_script.R".

I am making a new issue for this just to make organizing easier ... @wangxm-forest Can you check the code and output related to this when you are back and have time? Thanks.

selenashew commented 2 weeks ago

Hi everyone,

As mentioned during the previous in-person lab meeting, I realized that there were more issues with the seed data & phen data and went back to fix my code to address all of the issues.

Issues included (among many):

weird symbols (trailing -, Â§, §, |, etc., had to go back to the pdf and figure out what the actual values should be)
duplicate columns (different column names that actually contain the same info; this happened repeatedly due to lack of column name standardization across the various authors)
incorrectly transcribed column names/values
values that got converted into dates from the initial scraping
other misc. weird issues

I believe that I have now officially officially cleaned both datasets (fingers crossed!). The column headers are also now in camelCase. Please look through the files below:

Cleaning script: "phen_seed_data_final_cleaning_script.R" (in the cleaning folder)
Cleaned phen data: "cleaned_phen_data_final.csv" (output folder)
Cleaned seed data: "cleaned_seed_data_final.csv" (output folder)

lizzieinvancouver commented 2 weeks ago

@selenashew Thank you! The R scripts and any CSV files should also be camelCase. Can you update to follow this?

selenashew commented 2 weeks ago

The R scripts and CSV files have been renamed to camelCase! They are now called phenCleanedFinalMaster.csv, seedCleanedFinalMaster.csv, and finalPhenSeedCleaningScript.R. I have also updated the other scripts we made throughout the entire process to also be in camelCase.

lizzieinvancouver commented 2 weeks ago

@selenashew Thank you! @wangxm-forest is just back so hopefully in the coming weeks she can check out the data and make sure it all looks good and we can all follow what you did.

selenashew commented 2 weeks ago

Hi @wangxm-forest, I've created a data dictionary to make it easier to see what the units & definitions of the different columns for the seed & phen datasets are! It can be found here: analyses/scrapeUSDAseedmanual/output/usdaPhenSeedDataDictionary.xlsx

wangxm-forest commented 1 week ago

@lizzieinvancouver Sorry that I wasn't able to work on this earlier. I will try to check out the data by the end of this week! Thank you @selenashew for working on this!

wangxm-forest commented 1 week ago

@lizzieinvancouver I took a look the data sheets and am happy to see that there is so many useful information! However, I have a few questions and comments:

In the phenCleanedFinalMaster.csv, the "locationElevation" column contains a mix of numerical elevations and location names. It might be more useful to separate these into different columns for clarity.
In the same file, there are three columns—"seedLength," "cleanedSeedWtKg," and "cleanedSeedWtKg" that do not seem to be phonological traits (or are they?). I think it makes more sense to move them into seedCleanedFinalMaster.csv

Since Selena is no longer working in the lab, I’m happy to make the changes if you think they are necessary.

lizzieinvancouver commented 5 days ago

@wangxm-forest Thanks!

Okay if you want to switch for now, but it seems like we are not sure if we will use those data yet so I would wait to do any more work until you need the data.
Hmm, that sounds okay on quick glance, but I am not sure. I would check some of the tables in the PDF first and see how they compare to where the data in seedCleanedFinalMaster.csv is presented. If the phenological data is presented WITH the seed weight data, should we break it up? You should check how authors refer to the tables in a couple chapters too. Let me know what you find!

@selenashew is happy to keep helping the lab on a volunteer basis so if you need her help for anything I think it is okay to ask (and she can decline if too busy or such).

lizzieinvancouver / egret

cleaning phen and seed data from USDA #68