h2oai / datatable

A Python package for manipulating 2-dimensional tabular data structures
https://datatable.readthedocs.io
Mozilla Public License 2.0
1.81k stars 155 forks source link

Add support for python `in` operator in filtering functions #699

Open johnygomez opened 6 years ago

johnygomez commented 6 years ago

I'd like to filter rows according to functions like

lambda x: x[0] in my_list

which use pythonic syntax (syntactic sugar). Currently I need to rewrite this to primitive formula, testing all elements in the list separately.

st-pasha commented 4 years ago

Update https://stackoverflow.com/questions/61494957 when this is implemented

ghost commented 3 years ago

Any updates when this might be implemented?

samukweku commented 3 years ago

I guess the core maintainers are currently focused on building up the time series functionality in datatable; however, since it is open source, contributions are very much welcome.

ghost commented 3 years ago

I doubt I have the skills and deep level understanding to contribute such a feature. The fact that this feature is still missing implies to me that it takes some time and sophistication to develop it, hence the maintainers weren't able to include it so far. Regardless of that, what are the necessary educational resources to begin to understand how datatable works under the hood?

samukweku commented 3 years ago

@Peter-Pasta I am still finding my way around the source code. The core maintainers can explain better

st-pasha commented 3 years ago

We have a tutorial on creating a new datatable function: https://datatable.readthedocs.io/en/latest/develop/create-fexpr.html

Now, since in is an operator and not a regular function, the process will be slightly more complicated: you'd need to fill the tp_as_sequence slot and implement the sq_contains method.

As for the "core" of the function, then there are two examples that are quite similar: the replace() function, which compares each value with a list (or map) of values, and the join() function which compares each value with a sorted column via binary search.

Overall, on a difficulty scale from 1 (easy) to 5 (hard), I would rate this task as 2 or 3.

samukweku commented 3 years ago

I think it might be easier to write a function, instead of an operator for in, maybe dt.in. I would like to give it a shot

samukweku commented 3 years ago

Also need guidance @st-pasha @oleksiyskononenko ; when building datatable in editable mode, I dont have an easy-install.pth in my site-packages folder, only a easy-install.py file. As such, I cant run this command: echo "`pwd`/src" >> ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth

samukweku commented 3 years ago

@oleksiyskononenko @st-pasha Any ideas on how I can fix the issue above?

st-pasha commented 3 years ago

@samukweku Sorry, I was on vacation last week and didn't see your message.

So the main challenge with "editable mode" installations in python is that there is no official PEP standard for this, which makes it hard to provide reliable instructions here. You can try one of the following approaches:

  1. Create the easy-install.pth file using the command above. It should work as-is, or if you have an older version of shell, try echo "`pwd`/src" >> `ls ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth`.
  2. Create a virtual environment specifically for datatable development, using the virtualenv command.
samukweku commented 3 years ago

@st-pasha , still having issues with the installation. Sucessfully got it as editable. However, the datatable version is 0.11.1. I uninstalled it, (pip uninstall datatable), thinking that would take care of the problem (as suggested here); however I get the error message below, when I try to run make test :

make test                                                                                                                                                             (make_mistakes) 
python -m pytest -ra --maxfail=10 -Werror tests
ImportError while loading conftest '/home/sam/github/datatable/tests/conftest.py'.
tests/__init__.py:14: in <module>
    from datatable.lib import core
E   ModuleNotFoundError: No module named 'datatable'
make: *** [Makefile:59: test] Error 4

Could you kindly suggest how I can fix this?

st-pasha commented 3 years ago

On my computer I have the following configuration: the repository is checked out into

$ pwd
/Users/pasha/github/datatable

The content of the "easy-install.pth" is

$ ls ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth
/Users/pasha/py36/lib/python3.6/site-packages/easy-install.pth
$ cat `ls ${VIRTUAL_ENV}/lib/python*/site-packages/easy-install.pth`
/Users/pasha/github/datatable/src

And I can verify that this works by checking

$ python
Python 3.6.6 (v3.6.6:4cf1f54eb7, Jun 26 2018, 19:50:54) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import datatable
>>> datatable.__file__
'/Users/pasha/github/datatable/src/datatable/__init__.py'

The import command may fail like this if the core wasn't compiled yet with either make debug or make build:

>>> import datatable
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pasha/github/datatable/src/datatable/__init__.py", line 23, in <module>
    from .frame import Frame
  File "/Users/pasha/github/datatable/src/datatable/frame.py", line 23, in <module>
    from datatable.lib._datatable import Frame
  File "/Users/pasha/github/datatable/src/datatable/lib/__init__.py", line 31, in <module>
    from . import _datatable as core
ImportError: cannot import name '_datatable'

However, if the import says that datatable not found, then it would indicate the installation in editable mode failed somehow.

samukweku commented 3 years ago

@st-pasha thanks; found the error on my end and fixed; the echo part wasn't copying the right thing to my easy-install.pth file. All good now.

Another question: if changes are made to the C++ code, make build is required. How do I test code changes in the python section? say for instance i want f.string_column.len() to return 2. silly example but i hope you get my point. This does not involve any C++, so how do I do that?

st-pasha commented 3 years ago

If you make changes to C++, you need to run make build (or make debug) and then restart python console (or reload kernel in jupyter). If you make changes to python only, then you just need to restart the python console.