Split up parse() - Githubissues

oasis-open / cti-python-stix2

OASIS TC Open Repository: Python APIs for STIX 2

https://stix2.readthedocs.io/

BSD 3-Clause "New" or "Revised" License

373 stars 120 forks source link

Split up parse() #135

Closed clenk closed 6 years ago

clenk commented 6 years ago

parse() does two things:

ensures input is a dictionary (converting it if not, potentially calling json.load() or json.loads())
converts a dictionary into a python-stix2 object

We should split up these two parts. That way we can just call the second part to convert a dict to a STIX2 Object or Bundle, without having to potentially deal with the overhead (or check for the case of) a string or file object.

elegantmoose commented 6 years ago

So as the first part of parse() just calls get_dict() , we just want to have parse() do the version mapping and stix dict -> stix obj. Correct?

And then make all existing parse() calls be changed to: do get_dict() if needed, then always call parse()

clenk commented 6 years ago

Thinking about this again, this trades usability for flexibility. Whereas before a user could just throw anything at parse(), after this they'd have to remember that if it's not a dictionary, they have to call get_dict() first.

What if you pull out the version mapping / dict -> obj stuff to a new function, and then have parse() call get_dict() and the new function? Then users can still call parse() and throw whatever at it, and in the library where the get_dict part isn't needed, we can just call the new function.

elegantmoose commented 6 years ago

Okay, but I only see parse() being used in TAXIICollection, Memory and FileSystem, so my question is where would this new sub-method be used within the library?

Also get_dict() has no overhead if the data is already a dict, it just returns.

@gtback , any insight?

gtback commented 6 years ago

I'm fine having an all-purpose "do whatever you can to get a STIX object out of this argument" function. I really just wanted to avoid unnecessary round-trips to and from dictionaries (particularly when you had to deal with allow_custom. I would prefer, as @clenk suggested, to have a more tailored "I know this is already a dict" function, and call that from the DataSource subclasses.

clenk commented 6 years ago

One of the places to use the new_parse_without_dict function is in TAXIICollectionSink.add() when stix_data is a dict representing a bundle: https://github.com/oasis-open/cti-python-stix2/blob/master/stix2/datastore/taxii.py#L78

(From #134)

elegantmoose commented 6 years ago

@gtback , where are the round-trips to/from dictionaries that are happening? I see this happening in TAXIICollectionSink.add() and FileSystemSink.add() . Looking at those instances of parse(), it seems you could trigger STIX string -> STIX dict -> STIX obj -> STIX string (if you pass a STIX string). Are those the roundtrips you dont want? If so, splitting up parse() doesnt change that, right? You still would have to call the full parse process.

Also, there is no conversion/overhead if you call parse() and pass a dict. parse() calls get_dict() which returns immediately if the data is already a dict. Correct? So https://github.com/oasis-open/cti-python-stix2/blob/master/stix2/datastore/taxii.py#L78 only triggers an extra if-statement which is negligible.

Sorry if I am belaboring the point, I just still dont think I understand the goal of the separate functions.

gtback commented 6 years ago

STIXObjectProperty.clean calls get_dict(), then checks some things on the dict before calling parse(). The overhead of calling get_dict() on something that's already a dict is minimal, yes (there is the overhead of a function call), but it reflects ambiguity/uncertainty about what a certain data type is at a certain point.

FileSystemSink.add relies on parse() to treat strings and dicts in the same elif. I would rather say "if it is a string: parse string into dict" then (separately) say "if it is now a dict, create python-stix2 class out of the dict". TAXIIDataSink.add does something similar, but uses different branches of logic for dicts and strs.

For user-facing functions (like parse()), that flexibility is OK (we can say in the docstrings "give me X, Y, or Z and I will give you A, or raise error E"), but for internal uses, I'd rather be deliberate than sloppy with data types.

I'm just looking for opportunities to refactor so each function does exactly one thing.

elegantmoose commented 6 years ago

Got it. I just wanted to clarify the specific purpose. As I didn't see noticeable performance improvements or current round-trips from being avoided.

gtback commented 6 years ago

This should be fixed by #153