Closed clenk closed 6 years ago
So as the first part of parse() just calls get_dict() , we just want to have parse() do the version mapping and stix dict -> stix obj. Correct?
And then make all existing parse() calls be changed to: do get_dict() if needed, then always call parse()
Thinking about this again, this trades usability for flexibility. Whereas before a user could just throw anything at parse()
, after this they'd have to remember that if it's not a dictionary, they have to call get_dict()
first.
What if you pull out the version mapping / dict -> obj stuff to a new function, and then have parse()
call get_dict()
and the new function? Then users can still call parse()
and throw whatever at it, and in the library where the get_dict
part isn't needed, we can just call the new function.
Okay, but I only see parse() being used in TAXIICollection, Memory and FileSystem, so my question is where would this new sub-method be used within the library?
Also get_dict() has no overhead if the data is already a dict, it just returns.
@gtback , any insight?
I'm fine having an all-purpose "do whatever you can to get a STIX object out of this argument" function. I really just wanted to avoid unnecessary round-trips to and from dictionaries (particularly when you had to deal with allow_custom
. I would prefer, as @clenk suggested, to have a more tailored "I know this is already a dict" function, and call that from the DataSource subclasses.
One of the places to use the new_parse_without_dict function is in TAXIICollectionSink.add()
when stix_data
is a dict representing a bundle: https://github.com/oasis-open/cti-python-stix2/blob/master/stix2/datastore/taxii.py#L78
(From #134)
@gtback , where are the round-trips to/from dictionaries that are happening? I see this happening in TAXIICollectionSink.add() and FileSystemSink.add() . Looking at those instances of parse(), it seems you could trigger STIX string -> STIX dict -> STIX obj -> STIX string (if you pass a STIX string). Are those the roundtrips you dont want? If so, splitting up parse() doesnt change that, right? You still would have to call the full parse process.
Also, there is no conversion/overhead if you call parse() and pass a dict. parse() calls get_dict() which returns immediately if the data is already a dict. Correct? So https://github.com/oasis-open/cti-python-stix2/blob/master/stix2/datastore/taxii.py#L78 only triggers an extra if-statement which is negligible.
Sorry if I am belaboring the point, I just still dont think I understand the goal of the separate functions.
STIXObjectProperty.clean
calls get_dict()
, then checks some things on the dict before calling parse()
. The overhead of calling get_dict()
on something that's already a dict is minimal, yes (there is the overhead of a function call), but it reflects ambiguity/uncertainty about what a certain data type is at a certain point.
FileSystemSink.add
relies on parse() to treat strings and dicts in the same elif
. I would rather say "if it is a string: parse string into dict" then (separately) say "if it is now a dict, create python-stix2 class out of the dict". TAXIIDataSink.add
does something similar, but uses different branches of logic for dicts and strs.
For user-facing functions (like parse()
), that flexibility is OK (we can say in the docstrings "give me X, Y, or Z and I will give you A, or raise error E"), but for internal uses, I'd rather be deliberate than sloppy with data types.
I'm just looking for opportunities to refactor so each function does exactly one thing.
Got it. I just wanted to clarify the specific purpose. As I didn't see noticeable performance improvements or current round-trips from being avoided.
This should be fixed by #153
parse()
does two things:json.load()
orjson.loads()
)We should split up these two parts. That way we can just call the second part to convert a dict to a STIX2 Object or Bundle, without having to potentially deal with the overhead (or check for the case of) a string or file object.