piisa / pii-data

Base data structures for PII management
Apache License 2.0
2 stars 2 forks source link

Example implementation #3

Open omri374 opened 2 years ago

omri374 commented 2 years ago

@paulovn I've started thinking about implementing one API using the pii-data objects. What would be your suggested approach? This is what I have in mind:

  1. Read a set of files (using RawReader)
  2. Pass to an identification API
  3. Translate results into a list of PiiEntity and PiiCollection.
  4. Return a JSON containing all results

Is this aligned with the intended use? Can I leverage other parts of the framework?

paulovn commented 2 years ago

I think this is a good idea. Question: would the API (I assume a REST API) work at the level of individual document or as a collection?

It would be great to make that work into some kind of streaming fashion (i.e. send chunks and receive entities). Of course, this would make it more complicated, so I would leave it to phase 2

A couple of more notes:

omri374 commented 2 years ago

Thanks! we can definitely look into the streaming option, but I'd like to start with something simple to see that it all combines. For API I was actually thinking of the Python API, but we can do both. This would probably be the easiest as it doesn't require to start a web server.

I'll wait for your changes and then introduce mine.

omri374 commented 2 years ago

In Presidio, we generalized semi-structured data (tabular, json) into Dict[str, Union[Any, Iterable[Any]]]. This covers column based (key = column name, values = column values), row based (key = row id, values = row values), key-value and nested key-value. We can consider something similar here too.

paulovn commented 2 years ago

Sounds good! In my working version I was only considering row-based tables (pure enumerations, no specific row keys) for simplicity, but we can enlarge it later with something like you say

paulovn commented 2 years ago

Ok, my PR is now online: #4

Sorry for its size, it's quite significant since I've tried to accomodate the specification into it (table documents, context, etc)

NB: I've also started a companion repo, pii-preprocess, to add there more specific code for the preprocessing stage (document reading, etc)