Open kwinkunks opened 1 year ago
Hello, I would like to work on this. Can you elaborate more on what is expected?
@bhoomikaagrawal16 hello, and thanks for thinking of contributing!
I guess there's at least a couple of scenarios:
There are 3 place I put things:
duplicates.py
-- I would start heresklearn
transformers, both supervised and unsupervised, in `sklearn.py
(usually trying to use functions from whatever modules)pandas
accessors, in pandas.py
(usually trying to use functions from whatever modules)So a good place to start might be to create a module with an experimental 'duplicate detecting' function. It needs to be fast enough to work reasonably fast on at least 100k records, as a rule of thumb.
Write simple docstrings and doctests please (see the other modules).
Does this help? Let me know if you need more.
Duplicate records can be a problem, could check for this? E.g. check unique rows, or make a set? Could be nasty for a large dataset though? Easy enough to experiment with some random data (presumably worst case scenario)