Provide functionality to merge multiple metadata workbooks into single for qiita upload

AmandaBirmingham commented 6 years ago

Qiita only accepts a single prep template, so if the user generates multiple metadata workbooks for their experiment because they have multiple sample types, then those need to be merged into one big table for Qiita upload.

User should be able to go to a page for this purpose, select multiple files to upload, have them merged on the server, and then returned to them.

AmandaBirmingham commented 6 years ago

@adswafford to write the code to read the uploaded files, pull the metadata grid from the metadata spreadsheet in each, merge them using pandas data frames, and create desired output. Austin, suggest that you use the openpyxl library and build off the functionality at https://github.com/ucsd-ccbb/cmi_metadata_wizard/blob/a263a38f134da693246a42386420cfb80c61a879/metadata_package_schema_builder.py#L117-L126 which shows how to use this library to open an xlsx file, find a sheet, and read the contents of one cell.

AmandaBirmingham commented 6 years ago

Amanda to add additional URL and tornado handler, with template including upload widget, that implements merge code provided by Austin.

AmandaBirmingham commented 6 years ago

@adswafford : I have implemented the additional URL and tornado handler, with template including upload widget, as well as the necessary session handling. As far as I know, the ball for this task is in your court for now.

adswafford commented 6 years ago

Goal is Jan 5 for having the code for the merge.

To make sure we hook it up correctly, 'file_info_dicts_list' is a list with the files in .xlsx format to parse and merge right? I need to have a look at the code to make sure we can handle and display errors correctly, what's the elegant way to capture this and let users troubleshoot? Use case: user tries to merge one spreadsheet with a 'Fix' cell (fails validation), so the script errors out and provides a message, how easy will it be to display this and let the user retry after leading the corrected file?

And is there a URL for the dummy webpage that doesn't merge the templates?

AmandaBirmingham commented 6 years ago

@adswafford : The url for the merge page is the main url with "/merge" after it :) This page implements the file upload; note that you can drag and drop files you want to upload anywhere on the page, instead of/in addition to choosing them using the "add files" button. However, of course it does not currently implement any no actual merge functionality; after the upload, it redirects you to the download page with a dynamically-generated download filename of the sort I expect you might get after a real merge (which, if you try to download it, will give you 404 error since no such actual merge file is being created.)

The code you are writing needs to be hooked into the metadata_wizard_server.MergeHandler.post function, at these lines: https://github.com/ucsd-ccbb/cmi_metadata_wizard/blob/980686bde0d9008df8c750153f7f9f0fb46f750d/metadata_wizard/metadata_wizard_server.py#L146-L158

file_info_dicts_list is technically a list of dictionaries where each item in the list is a dictionary holding the uploaded file information produced by the file upload process. Each of these dictionaries contains three main keys, with their associated values: "filename" (e.g. "mytemplate.xlsx"), "content_type" (e.g., "'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'"), and the really important one, "body", which holds the actual contents of the file (NB: not a PATH to a file). How you choose to use these pieces of info is of course up to you, but I have yet to find a good way to read xlsx content from a stream with the available python libraries, and have thus found it necessary to dump each file's body content at least temporarily to a file as shown at https://github.com/ucsd-ccbb/cmi_metadata_wizard/blob/980686bde0d9008df8c750153f7f9f0fb46f750d/metadata_wizard/metadata_wizard_server.py#L89-L103 .

Error display CAN be very simple (or complicated, of course, depending on what you have in mind). A very easy approach would be to generate a list of error messages in your merge function and return them (return null if there are no errors) ... if the code in the MergeHandler gets a non-null error messages list, it can redirect to a page that displays the errors and just gives the user a link on that page to go back to the merge page and re-attempt the uploads, whereas if it gets no error messages it can direct (as it does now) to the download page.

AmandaBirmingham commented 6 years ago

@adswafford : I have merged your pull request; you can now find it in the main branch, but more work is required. After making necessary small integration changes (importing the libraries used, etc), I ran the pulled code on some spreadsheets that I created with the Metadata Wizard and then filled manually. I see several items that I believe need to be addressed:

1) At root, the approach in the pulled code does not seem to work: the issue is that the code is looking at cell.value, which sees the formulas rather than what they evaluate to. For example, this means that the "if str(cell.value) == 'Fix':" statement is never triggered (because the value "Fix" is dynamically decided on via a formula, and is not the actual value of the cell) and thus sheets with invalid values are not excluded. Default values are also not correctly represented in the output, because those are also populated via formulas. 2) The code loop over all rows in the metadata sheet, NOT just those rows that contain actual user input. Since, as you recall, the requirement was to support users entering on the order of a thousand rows, this means that there are formulas in (currently) the first thousand rows regardless of whether the user actually entered any info into them--and currently these rows' formulas are being placed in the output. 3) The code is raising Python errors, but is not catching them to generate a list of error messages for return to a front-end dedicated error page as suggested in the earlier discussion of this issue. This means that if any error is encountered, the user gets a basic "500 Internal Server Error" message with no additional information or redirection.

I am attaching the example output that I got running the code, which may make items 1 and 2 above clearer (note it was really produced as a .tsv, but GitHub won't let me upload a file with that extension, so I had to change it to .txt :-/ )

All of the issues above CAN be addressed, but not without additional development time from you or me or both. Please let me know how you would like to proceed.

ebi-submittable-rev_test.txt

adswafford commented 6 years ago

Ha, okay that's what I get for coding in a hurry and not testing with live files...

I'll mod the code to correct these issues and submit a PR early next week; do you have a target date for the next server update? Maybe after the packages are prepared?

On Fri, Jan 5, 2018 at 1:22 PM, Amanda Birmingham notifications@github.com wrote:

@adswafford https://github.com/adswafford : I have merged your pull request; you can now find it in the main branch, but more work is required. After making necessary small integration changes (importing the libraries used, etc), I ran the pulled code on some spreadsheets that I created with the Metadata Wizard and then filled manually. I see several items that I believe need to be addressed:

At root, the approach in the pulled code does not seem to work: the issue is that the code is looking at cell.value, which sees the formulas rather than what they evaluate to. For example, this means that the "if str(cell.value) == 'Fix':" statement is never triggered (because the value "Fix" is dynamically decided on via a formula, and is not the actual value of the cell) and thus sheets with invalid values are not excluded. Default values are also not correctly represented in the output, because those are also populated via formulas.

The code loop over all rows in the metadata sheet, NOT just those rows that contain actual user input. Since, as you recall, the requirement was to support users entering on the order of a thousand rows, this means that there are formulas in (currently) the first thousand rows regardless of whether the user actually entered any info into them--and currently these rows' formulas are being placed in the output.

The code is raising Python errors, but is not catching them to generate a list of error messages for return to a front-end dedicated error page as suggested in the earlier discussion of this issue. This means that if any error is encountered, the user gets a basic "500 Internal Server Error" message with no additional information or redirection.

I am attaching the example output that I got running the code, which may make items 1 and 2 above clearer (note it was really produced as a .tsv, but GitHub won't let me upload a file with that extension, so I had to change it to .txt :-/ )

All of the issues above CAN be addressed, but not without additional development time from you or me or both. Please let me know how you would like to proceed.

ebi-submittable-rev_test.txt https://github.com/ucsd-ccbb/cmi_metadata_wizard/files/1607916/ebi-submittable-rev_test.txt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ucsd-ccbb/cmi_metadata_wizard/issues/89#issuecomment-355668820, or mute the thread https://github.com/notifications/unsubscribe-auth/AZxBhnL7SIXFf0BBh5-m37IpGx_BFWreks5tHpKggaJpZM4RCr81 .

AmandaBirmingham commented 6 years ago

Per Austin's email 2018-01-08: "Likewise, doing the merge manually by saving TSVs and then having them upload to Qiita and update 1 time per sample group is not a terrible task so delaying this for now is reasonable."

adswafford commented 5 years ago

Now that Qiita can accept Qiimp templates and Qiimp lives in Qiita, I think we'll let Qiita sort this out.

ucsd-ccbb / qiimp

Provide functionality to merge multiple metadata workbooks into single for qiita upload #89