hubverse-org / hubValidations

Testing framework for hubverse hub validations
https://hubverse-org.github.io/hubValidations/
Other
1 stars 3 forks source link

`validate_submission()` does not accept filename with compression (for example: `.gzip.parquet` #84

Closed LucieContamin closed 2 weeks ago

LucieContamin commented 1 month ago

It seems that when arrow is writing a compressed parquet file the compression can or cannot be included in the filename, both will work. For example: arrow::write_parquet(df, "model-output/JHU_UNC-flepiMoP/2024-04-28-JHU_UNC-flepiMoP.gzip.parquet", compression = "gzip", compression_level = 9) and arrow::write_parquet(df, "model-output/JHU_UNC-flepiMoP/2024-04-28-JHU_UNC-flepiMoP.parquet", compression = "gzip", compression_level = 9), both returns the same file with the same content and is possible to read the files with the same arrow function call.

However, when using the validate_submission() , if we use a filename with the compression information, it will returns an error and not validate the files:

✖ 2024-04-28-JHU_UNC-flepiMoP.gz.parquet: EXEC ERROR: Error in parse_file_name(file_path) : Could
  not parse file name 2024-04-28-JHU_UNC-flepiMoP.gz for submission metadata. Please consult
  documentation for file name requirements for correct metadata parsing.
annakrystalli commented 1 month ago

Discussed at retreat and decided this should be supported. The parquet compression options available are: "snappy", "gzip", "brotli", "zstd", "lz4", "lzo" and "bz2". See https://arrow.apache.org/docs/r/reference/write_parquet.html