logv / snorkel

UI for interactive data analysis | https://snorkel.logv.org
https://fb.com/groups/snorkelsnorkelsnorkel
161 stars 21 forks source link

Snorkel prepends "string_", "integer_", or "set_" to column names of imported data #53

Open graysonchao opened 3 years ago

graysonchao commented 3 years ago

Hello! I stood up a local Snorkel instance and submitted some data through the HTTP API. The resulting data shows up, but the column names all have the type prepended, and they shouldn't. I'm running stock Snorkel installed from pip in a python2.7 environment.

Here's the raw data I'm submitting:

$ curl -XPOST -H "Content-Type: application/json" --data '{"dataset": "testdata1", "subset": "names_and_points", "samples": "[{\"integer\": {\"points\": 396, \"time\": 1610408874}, \"string\": {\"name\": \"ywtfsompmj\"}}, {\"integer\": {\"points\": 612, \"time\": 1610408874}, \"string\": {\"name\": \"nhojwjzudp\"}}, {\"integer\": {\"points\": 44, \"time\": 1610408874}, \"string\": {\"name\": \"hgfepsymqz\"}}, {\"integer\": {\"points\": 714, \"time\": 1610408874}, \"string\": {\"name\": \"aqskbeuvau\"}}, {\"integer\": {\"points\": 840, \"time\": 1610408874}, \"string\": {\"name\": \"klsrwaumek\"}}, {\"integer\": {\"points\": 576, \"time\": 1610408874}, \"string\": {\"name\": \"pyaqlbozwa\"}}, {\"integer\": {\"points\": 321, \"time\": 1610408874}, \"string\": {\"name\": \"hzykxdpqyw\"}}, {\"integer\": {\"points\": 414, \"time\": 1610408874}, \"string\": {\"name\": \"bsovbngmwc\"}}, {\"integer\": {\"points\": 451, \"time\": 1610408874}, \"string\": {\"name\": \"vnnrcuxjsv\"}}, {\"integer\": {\"points\": 385, \"time\": 1610408874}, \"string\": {\"name\": \"aghjoqhiin\"}}]"}' localhost:2333/data/import
{"success": true, "num_samples": 10}

Then, when I go to examine the data I imported, it looks like this:

Screen Shot 2021-01-11 at 3 53 43 PM

I suspect that I've somehow gotten on the codepath for nested dicts. I created this particular JSON blob with a Python script following along with Importing Data.

raisjn commented 3 years ago

yes, it is the nested record that you are running into as the prefix.

the data no longer needs to be in the nested format like that, you can supply mixed records.

i will update the docs