ropensci / EDIutils

An API Client for the Environmental Data Initiative Repository
https://docs.ropensci.org/EDIutils/
Other
10 stars 2 forks source link

read_data_entity_names of multiple package_ids #47

Closed paschatz closed 1 year ago

paschatz commented 1 year ago

Hey!

I need a hand here. I have a list with the latest package IDs of multiple datasets. (accessed 2023-09-25). Instead of accessing them one by one, I want to run all of them at once:

# This is a list with all the package IDs I am trying to access (n=81)
package_ids <- data_list$packageid #chr

# Download all of the data entities for the package IDs in the list
for(i in 1:length(package_ids)){
  data_entity_names <- read_data_entity_names(packageId = package_ids[i])
}

and returns:

''Error in read_data_entity_names(packageId = package_ids[i]) : Not Found (HTTP 404). Failed to .''

Has anyone encountered a similar problem before? Do you have a better way/command to go around?

Thanks, Paschalis

clnsmth commented 1 year ago

Hi @paschatz, thank you for bringing this issue to our attention.

I've reviewed your report, and I'm unable to reproduce the error on my system. It seems that read_data_entity_names() is functioning as expected, returning a data frame containing data entity identifiers and names.

> # A slightly modified version of the original script
> 
> library(EDIutils)
> 
> # A list of data package IDs from which to get data entity names
> package_ids <- c("edi.1.1", "edi.3.1")
> 
> # Get data entity identifiers and names
> for (i in 1:length(package_ids)) {
+   data_entity_names <- read_data_entity_names(packageId = package_ids[i])
+   print(data_entity_names) # Print results so we can see them
+ }
                          entityId                                 entityName
1 cba4645e845957d015008e7bccf4f902                  E1 Plant Biomass 6 16.csv
2 482fef41e108b34ad816e96423711470 E1_Plant_Species_composition_6_16_long.csv
                          entityId                       entityName
1 76d277e7bcc9c97f2daa4fdfd55ef11f SBCMBON integrated benthic cover
>

Based on the comment in your script, "Download all of the data entities for the package IDs in the list," I understand that your objective may extend beyond merely retrieving data entity IDs and names. You may be interested in downloading these data entities and potentially parsing them into the R environment.

To achieve this, you'll need to iterate through each data entity ID, pass it to an appropriate parser, and store the results in variables within the R environment. For a practical example, please refer to the vignette on data access.

We acknowledge that this process may not be the most straightforward way to access data entities, and we are actively working on a more streamlined solution.

Please don't hesitate to reach out if you have any further questions or if I'm misinterpreting your use case.

paschatz commented 1 year ago

Yes, my objective is to download the data and I thought I first need to get the entitityID. But, I reproduced you code and it works also for me... so I did some digging and I noticed that I was trying to apply the function to a vector with NAs... Once I removed the NAs my code run smoothly. But still get the entityID only for the first entry. I will keep working on it.

As a recommendation for improvement I could (respectfully) suggest you to introduce a warning message: "hey dummy, your vector has NAs, drop them and re-try". (or something else :P )

Cheers and thank you for your response, Paschalis.

paschatz commented 1 year ago

Hey, So I went around it and I fixed the loop but the downloaded zip files seem to be broken. I run an example outside the loop to test whether is the loop or my code:

try_read <- read_data_entity(packageId = "knb-lter-cdr.444.8",
                             entityId =  "aa6271cfbaa0a63c092733fb8ae6c543")

transaction <- create_data_package_archive("knb-lter-cdr.444.8")

try_download <- read_data_package_archive("knb-lter-cdr.444.8",
                                          transaction,
                                          path = "data_cleaning/11_Cedar_Creek/")

Seems that even if I download a single file, I get the same problem. A colleague tried remotely and she has the same issue. I tried to download the data package from the website and works perfectly. Do you have any idea if there is a problem with the compressed files?

The error I get in my computer is: "unable to expand 'file_name.zip'. It is an unsupported format." (tried both in Mac and windows).

Thanks, Paschalis

clnsmth commented 1 year ago

Thanks for this additional information @paschatz.

I am able to reproduce the error on my machine and will look into it now.

My session info:

R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] EDIutils_1.0.2

loaded via a namespace (and not attached):
[1] httr_1.4.6     compiler_4.3.0 R6_2.5.1       tools_4.3.0    curl_5.0.1
clnsmth commented 1 year ago

This issue is occurring due to a change in the repository API 'Read Data Package Archive' method. The corresponding read_data_package_archive function has been updated and is available for immediate use by installing EDIutils with:

devtools::install_github(
  repo = "rOpenSci/EDIutils", 
  ref = "refactor-read-data-package-archive"
)

The function no longer uses a transaction identifier, so the new call becomes:

try_download <- read_data_package_archive(
    packageId = "knb-lter-cdr.444.8", 
    path = "data_cleaning/11_Cedar_Creek/"
)

Next steps are to update the docs, tests, and release into the development and main branches.

@paschatz, does this fix the issue? Is there anything else? Thanks again for reporting it!

paschatz commented 1 year ago

Hey @clnsmth,

Now works smoothly!! 💯

I appreciate your support.

Best, Paschalis

clnsmth commented 1 year ago

Happy to help @paschatz!

I'm going to reopen this issue, to serve as a reminder, until I get the fix released into the main branch.