scienxlab / redflag

Safety net for machine learning pipelines. Plays nice with sklearn and pandas.
https://scienxlab.org/redflag
Apache License 2.0
22 stars 6 forks source link

Check for duplicate records #67

Open kwinkunks opened 1 year ago

kwinkunks commented 1 year ago

Duplicate records can be a problem, could check for this? E.g. check unique rows, or make a set? Could be nasty for a large dataset though? Easy enough to experiment with some random data (presumably worst case scenario)

bhoomikaagrawal16 commented 1 year ago

Hello, I would like to work on this. Can you elaborate more on what is expected?

kwinkunks commented 1 year ago

@bhoomikaagrawal16 hello, and thanks for thinking of contributing!

I guess there's at least a couple of scenarios:

There are 3 place I put things:

So a good place to start might be to create a module with an experimental 'duplicate detecting' function. It needs to be fast enough to work reasonably fast on at least 100k records, as a rule of thumb.

Write simple docstrings and doctests please (see the other modules).

Does this help? Let me know if you need more.