lizzieinvancouver / egret

1 stars 0 forks source link

usda data scraping and cleaning process #36

Open FrederikBaumgarten opened 2 weeks ago

FrederikBaumgarten commented 2 weeks ago

@dbuona @lizzieinvancouver @DeirdreLoughnan here is what Justin did so far in his words: workflow:

All cleaning scripts can be found in "scrapeUSDAseedmanual/cleaning" and all intermediate data files can be found in "scrapeUSDAseedmanual/output/earlyIterationDataSheets"

Ideally I would have done this all in 1 script but we came across a couple issues along the way involving data cleaning that Selena was already working on, so we felt it made more sense to start new scripts upon older ones so that we wouldn't be tampering with changes done by hand through Excel.

I forgot to mention that "germinationCleaningFinal.R" gave the output "USDAGerminationCleanedFinal.csv" which was what I used to make the "usdaGerminationMaster.xlsx" file. There's another file in the earlyIterationDataSheets called "usdaGerminationJINJJA.csv" which I made as a backup because my RStudio was bugging out on me during the EGRET correction script writing.

dbuona commented 2 weeks ago

@buniwuuu @ngoj1 I've started to combine all the of the cleaning scripts into a master source file called clearnmerge_all_usda.R, however I can't get the 3 cleaning scripts to run sequentially. Can you work on getting it to run? Ideally, it would also be best to not write out intermediate xlsx files in between each script too. Let me know if you have any questions.

lizzieinvancouver commented 2 weeks ago

@FrederikBaumgarten possible helpful pseudocode:

chilldurminnonNA <- usda$species[which(is.na(usda$chill.dur.min)==FALSE)]
chilldurmaxonNA <- usda$species[which(is.na(usda$chill.dur.max)==FALSE)]
respvarminnoNA <- usda$species[which(is.na(usda$responsevarmin)==FALSE)]
respvarmaxnoNA <- usda$species[which(is.na(usda$responsevarmax)==FALSE)]

sppwithminmaxchill <- chilldurminnonNA[which(chilldurminnonNA %in% chilldurmaxonNA)]
sppwithminmaxresp <- respvarminnoNA[which(respvarminnoNA %in% respvarmaxnoNA)]
dbuona commented 2 weeks ago

Just to summarize here is what needs to happen on this code:

ngoj1 commented 2 weeks ago

Hello,

Sorry for the late reply! I just finished up taking grad photos with my family and Britany and I will be doing shoot elongation until 2pm, but after that I will try troubleshooting the code and converting the parts that were edited manually in excel into R code.

Best regards, Justin

On Wed., Jul. 10, 2024, 10:15 a.m. dbuona @.***> wrote:

Just to summarize here is what needs to happen on this code:

  • Get all the cleaning files to run without externally manipulating them in excel
  • In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling

— Reply to this email directly, view it on GitHub https://github.com/lizzieinvancouver/egret/issues/36#issuecomment-2221057037, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU5JG6IE4LDPVOJ6RQG765TZLVT27AVCNFSM6AAAAABKTL64JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRRGA2TOMBTG4 . You are receiving this because you were mentioned.Message ID: @.***>

ngoj1 commented 2 weeks ago

Get all the cleaning files to run without externally manipulating them in excel

@dbuona I pushed a new script called "cleanmerge_all_usda_JNVER" (Justin version since I wanted to keep your original script as a backup) where instead of using source() I just combined all the code in the three scripts and then found the majority of the changes we made manually in Excel through just fine combing through the columns. I'm hoping that there aren't any weird values left there, but it's possible I might have missed a few. In any case, Selena would have addressed these weird values in Issue #20 and if you ever encounter them and need me to fix them, I can do that in this new script I've just pushed.

In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling

My laptop is about to run out of battery as I forgot to bring my charger but I can address this when I get home later tonight!

If there are any warnings that pop up in the code please let me know and I will backtrack and figure it out.

ngoj1 commented 1 week ago

In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling

Sorry this took so long! In the "cleanmerge_all_usda_JNVER" script I added a section at the very bottom where I copied all of the cold.strat.dur.XXX (Avg, Min, and Max) column data into new columns called chill.dur.XXX.comb (for "combined") and just ran some tests to make sure that no NAs were being made or data being overwritten. I decided to put this all in a new column just so that we still have the original chill.dur.XXX columns prior to the merge in case we need them separated.

lizzieinvancouver commented 1 week ago

@dbuona Could you take a look at this and get us down to one functional script? It should be called cleanAllUsda.R ... and please delete all the other scripts and extraneous files.