Open FrederikBaumgarten opened 2 weeks ago
@buniwuuu @ngoj1 I've started to combine all the of the cleaning scripts into a master source file called clearnmerge_all_usda.R, however I can't get the 3 cleaning scripts to run sequentially. Can you work on getting it to run? Ideally, it would also be best to not write out intermediate xlsx files in between each script too. Let me know if you have any questions.
@FrederikBaumgarten possible helpful pseudocode:
chilldurminnonNA <- usda$species[which(is.na(usda$chill.dur.min)==FALSE)]
chilldurmaxonNA <- usda$species[which(is.na(usda$chill.dur.max)==FALSE)]
respvarminnoNA <- usda$species[which(is.na(usda$responsevarmin)==FALSE)]
respvarmaxnoNA <- usda$species[which(is.na(usda$responsevarmax)==FALSE)]
sppwithminmaxchill <- chilldurminnonNA[which(chilldurminnonNA %in% chilldurmaxonNA)]
sppwithminmaxresp <- respvarminnoNA[which(respvarminnoNA %in% respvarmaxnoNA)]
Just to summarize here is what needs to happen on this code:
Hello,
Sorry for the late reply! I just finished up taking grad photos with my family and Britany and I will be doing shoot elongation until 2pm, but after that I will try troubleshooting the code and converting the parts that were edited manually in excel into R code.
Best regards, Justin
On Wed., Jul. 10, 2024, 10:15 a.m. dbuona @.***> wrote:
Just to summarize here is what needs to happen on this code:
- Get all the cleaning files to run without externally manipulating them in excel
- In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling
— Reply to this email directly, view it on GitHub https://github.com/lizzieinvancouver/egret/issues/36#issuecomment-2221057037, or unsubscribe https://github.com/notifications/unsubscribe-auth/AU5JG6IE4LDPVOJ6RQG765TZLVT27AVCNFSM6AAAAABKTL64JOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRRGA2TOMBTG4 . You are receiving this because you were mentioned.Message ID: @.***>
Get all the cleaning files to run without externally manipulating them in excel
@dbuona I pushed a new script called "cleanmerge_all_usda_JNVER" (Justin version since I wanted to keep your original script as a backup) where instead of using source() I just combined all the code in the three scripts and then found the majority of the changes we made manually in Excel through just fine combing through the columns. I'm hoping that there aren't any weird values left there, but it's possible I might have missed a few. In any case, Selena would have addressed these weird values in Issue #20 and if you ever encounter them and need me to fix them, I can do that in this new script I've just pushed.
In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling
My laptop is about to run out of battery as I forgot to bring my charger but I can address this when I get home later tonight!
If there are any warnings that pop up in the code please let me know and I will backtrack and figure it out.
In germination_cleaning.R combine the cold stratification entries into the chilling columns ( i.e. cold stratification and chilling are the same thing, which should be considered chilling
Sorry this took so long! In the "cleanmerge_all_usda_JNVER" script I added a section at the very bottom where I copied all of the cold.strat.dur.XXX (Avg, Min, and Max) column data into new columns called chill.dur.XXX.comb (for "combined") and just ran some tests to make sure that no NAs were being made or data being overwritten. I decided to put this all in a new column just so that we still have the original chill.dur.XXX columns prior to the merge in case we need them separated.
@dbuona Could you take a look at this and get us down to one functional script? It should be called cleanAllUsda.R ... and please delete all the other scripts and extraneous files.
@dbuona @lizzieinvancouver @DeirdreLoughnan here is what Justin did so far in his words: workflow:
Started in the scapeUSDAseedmanual folder
in the cleaning folder is "germination_master_spreadsheet.csv" which is the original data
Parsed into RStudio with the cleaning script titled "germinationCleaning.R" which does the mass general cleaning like removal of random symbols, converting unreadable NAs into proper NA format, fixing species names, adding new columns for scarification and chilling etc.
the output of this was called germinationCleaned.xlsx, which Selena then went through and manually fixed some issues based on weird values from the USDA manual pdf
This was then saved as "germinationCleaned_official.csv"
Parsed this file into R through the cleaning script "germinationCleaningFinal.csv" where I changed column names and pivoted wider the germination response data
I then got the comment from Deirdre asking for metadata and to make some more changes so it's closer in format to EGRET
Thus I converted this to an excel file and made a new sheet for the metadata, saving these two separately as .csv in case anyone wanted them as .csv
Parsed the cleaned data into a new cleaning script called "germinationEGRETCorrections.csv" for the final round of touch ups
All cleaning scripts can be found in "scrapeUSDAseedmanual/cleaning" and all intermediate data files can be found in "scrapeUSDAseedmanual/output/earlyIterationDataSheets"
Ideally I would have done this all in 1 script but we came across a couple issues along the way involving data cleaning that Selena was already working on, so we felt it made more sense to start new scripts upon older ones so that we wouldn't be tampering with changes done by hand through Excel.
I forgot to mention that "germinationCleaningFinal.R" gave the output "USDAGerminationCleanedFinal.csv" which was what I used to make the "usdaGerminationMaster.xlsx" file. There's another file in the earlyIterationDataSheets called "usdaGerminationJINJJA.csv" which I made as a backup because my RStudio was bugging out on me during the EGRET correction script writing.