gitonthescene / csv-reconcile

A reconciliation service for OpenRefine serving data from a given CSV file.
MIT License
70 stars 8 forks source link

ValueError: 'item' is not in list #65

Closed woody544 closed 2 years ago

woody544 commented 2 years ago

I was able to set up and run csv-reconcile serve, but cannot run the example on the reps.tsv file I get ValueError: 'item' is not in the list, similarly when I try the progressives.tsv file I get ValueError: 'itemLabel' is not in list. The errors are otherwise identical, except the last few lines. I have tried restarting everything, and cannot get the init step to work before running the serve command. Any suggestions would be appreciated.

Last few lines of the error for reps.tsv:

  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 68, in init_db
    ididx = header.index(idcol)
ValueError: 'item' is not in list

Last few lines of the error for progressives.tsv:

  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 67, in init_db
    searchidx = header.index(searchcol)
ValueError: 'itemLabel' is not in list

The full error for reps.tsv:

(venv) C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile>csv-reconcile init sample/reps.tsv item itemLabel
Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 321, in main
    return cli()
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 271, in init
    return doinit(config, scorerOption, csvfile, idcol, namecol)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 259, in doinit
    initdb.init_db_with_context(csvfile, idcol, namecol)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 95, in init_db_with_context
    return init_db(db,
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 68, in init_db
    ididx = header.index(idcol)
ValueError: 'item' is not in list
gitonthescene commented 2 years ago

Would you mind posting exactly what commands you ran at the prompt to produce the errors? Ideally both the init and serve commands. I’m assuming these are just the .tsv files from the samples directory.

FWIW, I don’t have ready access to a Windows machine but hopefully we can work this out together.

gitonthescene commented 2 years ago

There might be a clue here. Looking forward to hearing back on what you typed.

woody544 commented 2 years ago

Would you mind posting exactly what commands you ran at the prompt to produce the errors? Ideally both the init and serve commands. I’m assuming these are just the .tsv files from the samples directory.

Yes, I am using the .tsv files from the samples directory.

In following the steps outlined, I have the error after the first init step:

(venv) C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile>csv-reconcile init sample/reps.tsv item itemLabel
Traceback (most recent call last):
  File "C:\Program Files\Python39\lib\runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Program Files\Python39\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile\venv\Scripts\csv-reconcile.exe\__main__.py", line 7, in <module>
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 321, in main
    return cli()
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\click\core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 271, in init
    return doinit(config, scorerOption, csvfile, idcol, namecol)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\__init__.py", line 259, in doinit
    initdb.init_db_with_context(csvfile, idcol, namecol)
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 95, in init_db_with_context
    return init_db(db,
  File "c:\users\jennifer.woodward\onedrive - usda\mygitonedrive\nalt4ma\csv-reconcile\venv\lib\site-packages\csv_reconcile\initdb.py", line 68, in init_db
    ididx = header.index(idcol)
ValueError: 'item' is not in list

(venv) C:\Users\jennifer.woodward\OneDrive - USDA\myGitOneDrive\NALT4MA\csv-reconcile>csv-reconcile serve
 * Serving Flask app 'csv-reconcile' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://127.0.0.1:5000 (Press CTRL+C to quit)
127.0.0.1 - - [22/Apr/2022 23:54:56] "GET /reconcile HTTP/1.1" 200 -

The browser appears as:

image

gitonthescene commented 2 years ago

Okay, the serve step can’t work until you get the init step to pass so let’s focus on that.

I saw your comment on the other issue. Did you actually try using the cp1250 encoding described there? It might be worth doing that after deleting everything and starting from scratch.

The only thing that’s clear is that the csv file is actually being read but it’s not seeing the columns as columns. It’s not a bad guess that the issue might be handling the encoding.

If this doesn’t fix it I may need to ask you to try from a custom branch where I add more debugging info.

woody544 commented 2 years ago

Did you actually try using the cp1250 encoding described there? It might be worth doing that after deleting everything and starting from scratch.

Yes, I did not make any change to the progressives.tsv file, and it returns essentially the same error. I am no longer getting an encoding error.

woody544 commented 2 years ago

I am having trouble tracing where idcol is first identified.

gitonthescene commented 2 years ago

Just to be super clear, in the other issue you mentioned changing the encoding of the file and not using the configuration suggested there. Are you saying you deleted everything and then followed the instructions from that issue?

Also, you don’t need progressives.tsv to run the init command and we’re currently focused on the init (i.e. first) command. I.e. the following from your previous post:

csv-reconcile init sample/reps.tsv item itemLabel

The args here are the csv file being reconciled against (i.e. sample/reps.tsv), the idcol item and the “name” column used to do the actual reconciliation. These two columns are expected to be found on the first line of the csv file.

If the encoding is wrong or you’re using the wrong separator then the parser might not recognize that these are separate columns. The default separator should work.

If you’re asking how these args get passed through to the code that’s failing I can walk you through it but the stack trace should tell you the files and lines you should look at.

gitonthescene commented 2 years ago

FWIW, I plan to issue a release soon with a change mentioned in the other issue which should more seamlessly handle various csv file encodings.

I’ll leave this issue open another week but if I don’t hear back, I’ll assume your issue has been resolved.

[EDIT] The release is now complete. You might want to start from scratch to see if this simply takes care of your issue.

gitonthescene commented 2 years ago

@woody544 Just checking in if everything’s working for you. If I don’t hear back, I’ll close this out next week.

yochannah commented 1 year ago

I had the same issue today, and eventually managed to figure out that for me the problem was that csv-reconcile thought all my column names were in fact one big column, e.g. ["Column1\tColumn2\tColumn3\tColumn4"]. Once I knew the issue it was easy enough for me to set up the config to tell it to split on \t.

I figured this out by adding a print statement in intidb.py just before line 87, stating what columns I had available to choose from. This might help others in the future if they're struggling to figure it out - would it be useful if I made a PR that did this? I don't know enough python to know if it's an appropriately pythonic way to behave ;)

gitonthescene commented 1 year ago

Instead of a print statement maybe it would be better to have clearer error messages. Would you be able to give me exact steps to reproduce include the csv to use?