internetarchive / openlibrary

One webpage for every book ever published!
https://openlibrary.org
GNU Affero General Public License v3.0
5.07k stars 1.31k forks source link

Add a text area on the batch import page to allow raw JSONL #9487

Open scottbarnes opened 2 months ago

scottbarnes commented 2 months ago

Problem

Currently we have an endpoint https://openlibrary.org/admin/imports/add which takes a list of ocaid archive.org identifiers. We want patrons to use the

https://openlibrary.org/import/batch/new endpoint which we should rename as imports to be consistent with /admin/imports/add and /imports.

A clear and concise description of what you want to happen

One should be able to import a new item by entering raw JSON into a text area in the batch import endpoint at /import/batch/new (https://openlibrary.org/import/batch/new).

Once the JSONL is submitted, the same validation that happens with an uploaded JSONL file should be run.

Additional Context

See #8122, which added the existing endpoint. This issue is to extend that by, e.g., adding a <textarea> where the JSONL can be entered instead of attaching it as a file.

It was probably a mistake to have batch_import take bytes here, as this tightly couples the implementation to a file upload: https://github.com/internetarchive/openlibrary/blob/e7f11e7c41b1a9317814c0e96cc1c9bf905c8b67/openlibrary/core/batch_imports.py#L73

Instead, this should likely take a list or perhaps a generator. In any event, by changing the function signature here it should be possible to have the form used for submitting raw JSONL input plug directly into this function, unless the form data comes in as bytes, which I think it will not by default. The batch_imports endpoint will be need to updated as well to account for this change away from bytes: https://github.com/internetarchive/openlibrary/blob/e7f11e7c41b1a9317814c0e96cc1c9bf905c8b67/openlibrary/plugins/openlibrary/code.py#L502-L504.

Proposal & Constraints

No response

Leads

Related files

Stakeholders


@mekarpeles

Instructions for Contributors

Devansh-Kushwaha commented 2 months ago

I would like to work on this issue and try to fix it. I have experience working with various python and web development libraries including some JSON manipulations and traversal. Assign me this issue

mekarpeles commented 1 month ago

@Devansh-Kushwaha did you have any questions about how to proceed? Since it's been two weeks, if you're no working on this issue we'd like to give someone else a chance :)

Devansh-Kushwaha commented 1 month ago

@Devansh-Kushwaha did you have any questions about how to proceed? Since it's been two weeks, if you're no working on this issue we'd like to give someone

Yes, I apologize for the delay. I am having problems in setting up the project. I am kinda new to docker.

scottbarnes commented 1 month ago

@Devansh-Kushwaha, if you share any questions you have perhaps we can help.

slimkevo commented 1 month ago

Can I please be assigned this issue?

mekarpeles commented 1 month ago

Let's also have a way to validate the input (e.g. pass in ?validate=true) and if flag exists, raise and don't import after validating.

mekarpeles commented 1 month ago

Give it a try @slimkevo! Can you reply with your approach and any blockers you're hitting as you get set up?

slimkevo commented 3 days ago

@mekarpeles I am experiencing issues with uploading raw JSON data through the import functionality. When I attempt to upload JSON data, the import fails with validation errors or SQL exceptions. Specifically, I receive errors indicating issues with the JSON format or missing fields in the database. SQL errors like 'column "submitter" of relation "import_item" does not exist' indicating discrepancies between the JSON data and the database schema.

scottbarnes commented 3 days ago

It sounds as if we may need to add that column to the local development environment SQL schema.

In the interim, something like this should at least resolve the error about the submitter column, @slimkevo:

❯ docker compose exec db bash
WARN[0000] The "HOST" variable is not set. Defaulting to a blank string. 
root@29b2d94b9e8d:/# psql -U openlibrary
psql (9.3.25)
Type "help" for help.

openlibrary=# ALTER TABLE public.import_item ADD COLUMN submitter text;
ALTER TABLE
openlibrary=# \d import_item
                                      Table "public.import_item"
   Column    |            Type             |                        Modifiers                         
-------------+-----------------------------+----------------------------------------------------------
 id          | integer                     | not null default nextval('import_item_id_seq'::regclass)
 batch_id    | integer                     | 
 added_time  | timestamp without time zone | default timezone('utc'::text, now())
 import_time | timestamp without time zone | 
 status      | text                        | default 'pending'::text
 error       | text                        | 
 ia_id       | text                        | 
 data        | text                        | 
 ol_key      | text                        | 
 comments    | text                        | 
 submitter   | text                        | 
... more output omitted ...

Please let us know how that goes, and whether more must be done to use the endpoint.

In this file there are a pair of uncommented records that should not have validation errors: two_item_import.txt