leap-stc / data-management

Collection of code to manually populate the persistent cloud bucket with data
https://catalog.leap.columbia.edu/
Apache License 2.0
0 stars 5 forks source link

SSH_filtering_data #141

Open Hamsterrrrrrrrr opened 1 month ago

Hamsterrrrrrrrr commented 1 month ago

Dataset Name

ssh_train_aug.zarr ubm_train_aug.zarr ssh_train.zarr ssh_val.zarr ssh_test.zarr bm_train.zarr bm_val.zarr bm_test.zarr ubm_train.zarr ubm_val.zarr ubm_test.zarr

Dataset URL

https://zenodo.org/records/6574307

Description

We used the data from https://zenodo.org/records/6574307 and apply augmentation to it(we generate randomly patches in the spatial domain and do some augmentation hence got a lot more data for training)

Size

100G, it is split into many files

License

Unknown

Data Format

Zarr

Data Format (other)

No response

Access protocol

HTTP(S)

Source File Organization

No response

Example URLs

No response

Authorization

No; data are fully public

Transformation / Processing

No response

Target Format

Zarr

Comments

No response

jbusecke commented 1 month ago

Hi @Hamsterrrrrrrrr, thanks for submitting a dataset request!

I think I need some further clarification on what exactly to ingest here. Following the link you sent I see two files:

image

Do you want these to be converted to zarr in the cloud? Or are there more as you indicated with:

100G, it is split into many files

Happy to work on this once we are clear on the details.

Hamsterrrrrrrrr commented 1 month ago

Hi Jbusecke,

The data link contains the raw data, on which I did some processing. The data files uploaded are the processed data, not the raw data. I have already converted them into .zarr before uploading. I hope this clarifies things.

Best regards, Yue

jbusecke commented 1 week ago

Hi @Hamsterrrrrrrrr,

The data link contains the raw data

The raw data is at https://zenodo.org/records/6574307 ?

, on which I did some processing.

What sort of processing did you do, and where is the output located?

We currently use the dataset ingestion to get officially archived/published datasets into an analysis ready cloud optimized format (e.g. zarr). If this data is processed by you, we need to figure out a way how to host/ingest the data, see the note box here. Maybe we should schedule a meeting call to discuss details?