kartoza / WRODataPlatform

WRC Water Research Observatory Data Platform
0 stars 3 forks source link

xls/xlsx improvements #61

Open vermeulendivan opened 1 year ago

vermeulendivan commented 1 year ago

Improvements to the excel formats.

Check how to do multiple sheets.

vermeulendivan commented 1 year ago

@mikev3003 This is not yet on the site, but want to provide some feedback.

And an important factor I want to point out, is when there are multiple tables stored in one sheet you send as an example, which gets very problematic. Such as that example you send me (3 tables in one sheet, only 2 visible in the screenshot): image It will just be impossible to accurately determine what "is a table", "where one table starts and another ends", etc. Basically we need to follow similar rules as with CSV to limit possible errors. Limitations won't be as hardcore as CSV, but having rules when it needs to be loaded into BQ will be a good idea. If the Excel file is not required to be loaded into BQ then its fine and can be stored in the storage with no issue. Will discuss this in more detail at our next meeting.

vermeulendivan commented 1 year ago

@ThiashaV @Jeremy-Prior Excel format can also now be tested.

@mikev3003 This has been deployed on GCP, so you can also give it a test if you want to.

Take into account what I explained in the previous comments. Such cases will be ingnored/skipped and the next sheet/file will be processing. Both xls and xlsx works. Multi-sheet support has been added, and the sheet name is included in the table name if the excel contains more than one sheet. The sheet name won't be used if there is only one sheet in the excel file.

mikev3003 commented 1 year ago

@vermeulendivan I've uploaded an xlsx test file, see it it GCP, but not in BQ 7OctCapture4 7OctCapture3

vermeulendivan commented 1 year ago

Hey @mikev3003 . Had a look at the Excel sheet you uploaded, and this is another good example of what cannot be loaded into BigQuery, because its multiple tables in a single (see https://github.com/kartoza/WRODataPlatform/issues/61#issuecomment-1269627490, just the comment above). Here is how the current Excel sheet looks: image And then there is another random table at the bottom: image Its impossible to "know" where a table starts, or ends, or what is a seperate table, and if there are random empty rows and columns it makes it even more difficult as there is no way to know if its null values or not. BigQuery tables cannot store such cases. There is nothing I can do about it, except skip such cases. Other sheets is still processed.

Please have a look at the documentation: https://github.com/kartoza/WRODataPlatform/blob/development/documents/SOP_upload_guide_v01.odt. It explains everything into detail. I've updated it yesterday. For this case, have a look at the "Multiple table in an Excel sheet" section. It explains into detail how to workaround such cases. You can even move each table in that sheet to its own sheet in the same file, so long a sheet has only one table.

mikev3003 commented 1 year ago

@vermeulendivan noted, thanks very much for the feedback

vermeulendivan commented 1 year ago

@mikev3003 Also, the sheet name determines the final table name in BigQuery. So a file named "Just_a_table.xlsx" with a sheets named "table1" and "anotherTable" would result in "Just_a_table_table1" and "Just_a_table_anotherTable" in BigQuery. If an excel file only has one sheet, the sheet name is not needed.

mikev3003 commented 1 year ago

@vermeulendivan impressive stuff! Any ideas whether we need to think about organising the BQ tables into 'folders' or categories, or is ending up with 1000s of tables in a BQ table list just the nature of the beast?

What happens when a table is uploaded with the same name as an existing table? (Apologies if I missed this in the documentation)

vermeulendivan commented 1 year ago

@mikev3003 I actually thought about it yesterday while working on the SOP (so nothing in the SOP on it yet). There are many ways we can go with it, but were planning on mentioning it in the meeting next week:

I like the first option the most, as it will have the added benefit of keeping the tables "more organized". What do you think?

mikev3003 commented 1 year ago

@vermeulendivan I like the first option too!

mikev3003 commented 1 year ago

@vermeulendivan checked out the sw_cape dataset (thanks @MikeSilberbauer!), pulled into BQ very nicely @Mohab25 wasn't able to download excel file though - can you pls check it out? 14OctCapture1

Mohab25 commented 1 year ago

@mikev3003 it's now correctly referenced, just want to ask how it's uploaded (through CKAN or GCP) ?

mikev3003 commented 1 year ago

Thanks very much @Mohab25, @MikeSilberbauer would've used CKAN to upload it