waldronlab / bugphyzz

Harmonized annotation of microbial physiology
http://waldronlab.io/bugphyzz/
5 stars 5 forks source link

invalid values #229

Open lwaldron opened 1 year ago

lwaldron commented 1 year ago

The following line in bugphyzzExports is identifying invalid values and dropping them. @sdgamboa please raise such curation issues here and discuss whether they should be resolved by correcting the invalid values, adding to the allowed vocabulary, or continuing to drop these values. For some, dropping certainly does seem like the right choice for ASR, but for others (like aerophilicity and shapes) I'm not so sure.

https://github.com/waldronlab/bugphyzzExports/blob/a9fc18914cb3b1d9ea3a3d1c0121ccac5c8d482a/inst/scripts/export_bugphyzz.R#L126

[1] "Invalid values for aerophilicity: "
# A tibble: 3 × 2
  Attribute_group Attribute         
  <chr>           <chr>             
1 aerophilicity   facultative aerobe
2 aerophilicity   microaerotolerant 
3 aerophilicity   positive          
[1] "Invalid values for biosafety level: "
# A tibble: 6 × 2
  Attribute_group Attribute                                           
  <chr>           <chr>                                               
1 biosafety level "biosafety level Risk group (German classification)"
2 biosafety level "biosafety level 11o58'14.4\\\""                    
3 biosafety level "biosafety level Germany"                           
4 biosafety level "biosafety level 1+"                                
5 biosafety level "biosafety level 3**"                               
6 biosafety level "biosafety level L1"                                
[1] "Invalid values for disease association: "
# A tibble: 13 × 2
   Attribute_group     Attribute                                      
   <chr>               <chr>                                          
 1 disease association caries                                         
 2 disease association periodontal disorder                           
 3 disease association Infection caused by Escherichia coli (disorder)
 4 disease association Endocarditis                                   
 5 disease association Meningitis                                     
 6 disease association Periodontal Disorder                           
 7 disease association Infection                                      
 8 disease association arthritis                                      
 9 disease association meningitis septicemia                          
10 disease association septicemia arthritis                           
11 disease association Fever                                          
12 disease association urlnary tract infection                        
13 disease association Tetnus                                         
[1] "Invalid values for growth medium: "
# A tibble: 2,191 × 2
   Attribute_group Attribute                                                                                  
   <chr>           <chr>                                                                                      
 1 growth medium   NUTRIENT AGAR (DSMZ Medium 1)                                                              
 2 growth medium   Marine agar (MA)                                                                           
 3 growth medium   R2A MEDIUM (DSMZ Medium 830)                                                               
 4 growth medium   ACETIVIBRIO MEDIUM (DSMZ Medium 122)                                                       
 5 growth medium   Zobell marine agar (ZMA)                                                                   
 6 growth medium   MEDIUM 1 - for Acetobacter, Azotobacter, Gluconobacter, Gluconacetobacter, Mesorhizodium c…
 7 growth medium   MEDIUM 85 - for Abiotrophia                                                                
 8 growth medium   GS2 agar plates                                                                            
 9 growth medium   TRYPTICASE SOY YEAST EXTRACT MEDIUM (DSMZ Medium 92)                                       
10 growth medium   MLO agar                                                                                   
# ℹ 2,181 more rows
# ℹ Use `print(n = ...)` to see more rows
[1] "Invalid values for shape: "
# A tibble: 20 × 2
   Attribute_group Attribute         
   <chr>           <chr>             
 1 shape           square            
 2 shape           vibriod cell      
 3 shape           rod-shaped        
 4 shape           coccus-shaped     
 5 shape           filament-shaped   
 6 shape           ellipsoidal       
 7 shape           pleomorphic-shaped
 8 shape           ovoid-shaped      
 9 shape           oval-shaped       
10 shape           other             
11 shape           sphere-shaped     
12 shape           spiral-shaped     
13 shape           curved-shaped     
14 shape           helical-shaped    
15 shape           vibrio-shaped     
16 shape           ring-shaped       
17 shape           spore-shaped      
18 shape           crescent-shaped   
19 shape           star-shaped       
20 shape           diplococcus-shaped
> 
sdgamboa commented 1 year ago

I think these come from the output of bacdiveR. @jwokaty, I've been using this spreadsheet, is there a newer version? Those from "biosafety level" seem to be incorrect parsing. I'll add the remaining values to the extdata/attributes.tsv file.

jwokaty commented 1 year ago

@sdgamboa I've created a new spreadsheet and it seems that the biosafety level, country, and geographic location appear to be formatted correctly; however, I have not yet replaced the BacDive sheet yet. I wanted to give you the opportunity to look at it first: https://docs.google.com/spreadsheets/d/1P4Ic6-N9GVXcX1CdfoamFt6eozfHqt-sxfIRTBvYHWk/edit?usp=sharing. If it looks good, I want to upload it as a new version to the BacDive document.

sdgamboa commented 1 year ago

@jwokaty, thanks! Values for biosafety level seem fine now and I no longer get 'X' columns when parsing the file. I added the url to this code: https://github.com/waldronlab/bugphyzz/blob/ed8b40fe21bb2da00e10a8b9c0405d36b5036cf2/R/bacdive.R#L21-L29. Please let me known if I new URL is needed or if you overwrite the previous spreadsheet.

library(bugphyzz)
bl <- physiologies('biosafety level')[[1]]
#> Finished biosafety level.
#> Warning: Missing columns in biosafety level. Missing columns are: Genome_ID,
#> Accession_ID
unique(bl$Attribute)
#> [1] "biosafety level 1"   "biosafety level 2"   "biosafety level 3"  
#> [4] "biosafety level 1+"  "biosafety level 3**" "biosafety level L1"

Created on 2023-09-20 with reprex v2.0.2

jwokaty commented 1 year ago

@sdgamboa I'm glad that it's working better. I think that we should use the original URL as we can make use of Google Sheet versioning. It only keeps a version history of 30 days but it will allow us to upload a new version without changing the URL in bugphyzz.

sdgamboa commented 1 year ago

@jwokaty, agreed. I'll switch back to the original URL when the spreadsheet gets updated.

jwokaty commented 1 year ago

I've updated the google sheet!