scopatz / pymoab

Python Bindings to MOAB
BSD 2-Clause "Simplified" License
5 stars 6 forks source link

Opaque Tag Data #8

Closed pshriwise closed 8 years ago

pshriwise commented 8 years ago

This has to deal with how we want users to be able to pass opaque tag data into pymoab.

Intuitively, users will want to pass tag data in like so:

>> tag_data = np.array(("Value1","Value2",))
>> mb.tag_set_data(tag_handle,entity_handles,tag_data)

and when we go to set the tag data the following will be done (in core.pyx):

 err = self.inst.tag_set_data(tag.inst, <unsigned long*> arr.data, len(entity_handles), <const void*> data.data)

The problem here is that the data type of this numpy array will be 'S6' , but the length of the tag in MOAB proper could be either less or more than that value. At this point the tag_data.data looks like "Value1Value2" instead of "Value1\x00\x00\x00Value2\x00\x00\x00" which is what MOAB expects if the tag length were to be 9.

If its higher than 6 (9 for example) then the data will be set as such:

Entity 1 Tag Value: Value1Val Entity 2 Tag Value: ue2(junk)

If it is lower than 6 (3 for example) then the data will be set as such:

Entity 1 Tag Value: Val Entity 2 Tag Value: ue1

Both of these are incorrect.

This can be remedied by adjusting the array entry size with

data = np.asarray(data,dtype='S9')

to expand the entry length or

data = np.asarray(data,dtype='S3')

to contract each entry (though this has obvious consequences as well).

Options: 1)This can either be handled under the hood in pymoab by detecting the tag length from the instance and adjusting the array dtype with numpy.asarray for opaque tag types only.

2) we can expect the users to setup data correctly from the start as such

data = np.array(["Value1","Value2",],dtype='S9')

which is what I'd like to do. The way we pass back and forth from the MOAB instance in pymoab right now is very clean and I don't want to muck it up by trying to handle this case. The downside is that we're asking users to know or retrieve the tag length prior to setting up data, but this is simply how we've done things in MOAB for a long time. Additionally I don't see another way to do this for bit tags and users will need to know how this works in that case anyways.

I hope this was clear enough! Please let me know what you think @vijay & @scopatz and I'll proceed as needed.

pshriwise commented 8 years ago

Should have tagged you guys, looking for some input from @scopatz,@vijaysm.

scopatz commented 8 years ago

Hi @Pshriwise - I think that option (1) is the correct thing to do, noting that if done properly than it doesn't discount (2). To me, this is what the np.asarray() see here is for. If the data type matches that of the array argument, the function doesn't do anything and just returns a reference to the original array. If the data types don't match, than it will do a copy with the new data type.

Also, for data like this in Python, I find it more convenient for the user to be able to pass in a list of variable length strings (or an array with the object data type). The np.asarray() function will then convert this into the appropriate dtype.

Does this sound reasonable to you?

pshriwise commented 8 years ago

@scopatz I'm still a little conflicted about this. I can't tell if I'm being vindictive and want to force users to track tag length like we have to do in c++ or not :), but tag length is an important part of how the underlying database is organized. As I write this though I really don't see any harm in using np.asarray() and tag_get_length to verify that the strings being passed each contain the correct number of bytes so long as we test it appropriately. I'll adjust my upcoming pull requests to reflect that behavior.

Moving forward with option 1, the tag data will come back with the correct data type which may be confusing to those adding string data which doesn't occupy the full tag length. I don't foresee this being a big problem unless a large number of elements are being tagged with byte data and users want to truncate the returned numpy array. At any rate, I'd like to make that clear to them somewhere in our future/theoretical documentation. :)

Addendum: For clarity, are you thinking of this just in the context of opaque tags or are you thinking that np.asarray() would be used in all cases to ensure we're using the correct data type? My fear in this case being that it would obscure the nature of the database too much. Users wouldn't track tag types correctly which would cause problems when visualizing data, searching for existing tags, etc.

scopatz commented 8 years ago

Yeah, so this does come down to reconciling philosophies between C/C++ and Python. In C, you know exactly what you have (except that types are ill-defined) and you can cut your own head off if you want to. In C++ the rules are very strict, but you might not know what they are. In Python, everything is supposed to just work, while giving you access to the details in case you care.

So if it is a couple of lines to make an API more generic and easier to use, than the Pythonic answer is to do that.

Now in the opaque case here, I think it is so prone to getting wrong that it is pretty much needed to do size lookup and np.asarray(). I think that there is a strong case be made for doing this in other places (if something is stored as float64, and you happen to pass it float32 should that legitimately be an error?) But it is probably not major enough to be necessary. That is more a matter of style.

Pontificating a bit here, in my experience, Python APIs that try to enforce C-level type strictness end up as bulky, unwieldy, and no one wants to use them. Case-in-point being PyTAPS, though I have seen others do this same thing. On the other hand, libraries that allow a little more fluidity in types (numpy, pandas, pytables, h5py) have been much more successful.

tl;dr It is probably the right thing to do for opaque types and can go either way for most other types.

pshriwise commented 8 years ago

I suppose you're right it really does come down to differences in language philosophies, and based on your prior experience with interfaces like these it sound like we should definitely go forward with a size lookup and np.asarray(). If people won't use/enjoy the interface, then our effort will be wasted.

For other data types, I'd rather not do this unless necessary as even small changes in conversion between floating point precisions can cause problems in underlying algorithms (DAGMC's particle tracking for example).

One other tag type we definitely should do this for, however, is bit tags. It is the same problem if I'm thinking about it correctly. Is that right @vijaysm?

vijaysm commented 8 years ago

I agree that option 1 is the way to go.

Now in the opaque case here, I think it is so prone to getting wrong that it is pretty much needed to do size lookup and np.asarray(). I think that there is a strong case be made for doing this in other places (if something is stored as float64, and you happen to pass it float32 should that legitimately be an error?) But it is probably not major enough to be necessary. That is more a matter of style.

I think that this should be a hard check, kinda like assert on all of the setters. You definitely don't want to create a tag in MOAB with a different type and then set a data that is not allocated correctly. This case should die.

@Pshriwise, I think we should be doing this check for all tag types, no ? I don't see why not.

I'm on travel this week and so my replies will be slow!

On Tue, Dec 1, 2015 at 12:38 PM, Patrick Shriwise notifications@github.com wrote:

I suppose you're right it really does come down to differences in language philosophies, and based on your prior experience with interfaces like these it sound like we should definitely go forward with a size lookup and np.asarray(). If people won't use/enjoy the interface, then our effort will be wasted.

For other data types, I'd rather not do this unless necessary as even small changes in conversion between floating point precisions can cause problems in underlying algorithms (DAGMC's particle tracking for example).

One other tag type we definitely should do this for, however, is bit tags. It is the same problem if I'm thinking about it correctly. Is that right @vijaysm https://github.com/vijaysm?

— Reply to this email directly or view it on GitHub https://github.com/scopatz/pymoab/issues/8#issuecomment-161057634.

pshriwise commented 8 years ago

@scopatz @vijaysm I think we're starting to get into a different (albeit important( area here.

1 is expanding opaque (and bit?) tag data handed via numpy array to tag_set_data s.t. it matches the tag length. 2 has to do with verifying that that data type being passed is a valid data type for the tag in question.

with regards to 1: Should the case in which a string in the np array is too long for a tag length fail? Or should we simply truncate it and give the user a warning notification?

2: We should certainly be checking that the incoming data array is the correct type for the tag in all cases. I'll make that an addition in the next PR I submit which expands the interface. A simple assertion on the data types should work, right?

scopatz commented 8 years ago

Should the case in which a string in the np array is too long for a tag length fail? Or should we simply truncate it and give the user a warning notification?

I would truncate silently, but a warning is a good idae.

A simple assertion on the data types should work, right?

Yeah. numpy.testing might have some special assertion functions, or I seem to recall that == works on dtypes as well.

pshriwise commented 8 years ago

It seems we will likely running into this problem with integer tags as well. On current systems, np arrays will default to int64 while c++ will be expecting tags 4 bytes long.

From numpy's dtype page

There are 5 basic numerical types representing booleans (bool), integers (int), unsigned integers (uint) floating point (float) and complex. Those with numbers in their name indicate the bitsize of the type (i.e. how many bits are needed to represent a single value in memory). Some types, such as int and intp, have differing bitsizes, dependent on the platforms (e.g. 32-bit vs. 64-bit machines). This should be taken into account when interfacing with low-level code (such as C or Fortran) where the raw memory is addressed.

scopatz commented 8 years ago

Yeah, Python (and hence numpy) ints are default 64 bit.

vijaysm commented 8 years ago

I think we should be able to decipher the size of the numpy array and evaluate compatibility before set operations. Once we have unit tests, these can be checked for various configuration combinations of MOAB. I'm sure there are cases that will come up then.

On a separate note, we have been developing something calling iMOAB, which is a solver focused C/Fortran interface [1] for querying MOAB. Unsure if this is relevant to pymoab right now but if there are certain interfaces that need to be modified in either component, would be good to get discussions started.

[1] http://ftp.mcs.anl.gov/pub/fathom/imoab-docs/iMOAB_8h.html

On Wed, Dec 2, 2015 at 6:13 PM, Anthony Scopatz notifications@github.com wrote:

Yeah, Python (and hence numpy) ints are default 64 bit.

— Reply to this email directly or view it on GitHub https://github.com/scopatz/pymoab/issues/8#issuecomment-161463600.

pshriwise commented 8 years ago

@vijaysm yes, I think you're right. We should really just check all types for compatibility before setting data. I'll include this as part of the next PR.

As far as iMOAB goes, this looks like an offshoot of iMesh? Is that right? I'm not sure which would be better to build pymoab on top of. @scopatz might have more to say about that. What I like about this interface so far is that we're providing access to the pure MOAB interface which has been around for a long time with little change.

vijaysm commented 8 years ago

@Pshriwise @scopatz yes, it has inspirations from iMesh but hopefully, without all its issues. We let the calling user code allocate all data and the interface internally maintains essential entity lists in ranges for collective queries. Another differentiation is that iMOAB is more solver focused and we can certainly add helper routines for visualization, I/O etc. I'm not suggesting pyMOAB needs to use this but if it is relevant, that is great.

On Thu, Dec 3, 2015 at 12:14 AM, Patrick Shriwise notifications@github.com wrote:

@vijaysm https://github.com/vijaysm yes, I think you're right. We should really just check all types for compatibility before setting data. I'll include this as part of the next PR.

As far as iMOAB goes, this looks like an offshoot of iMesh? Is that right? I'm not sure which would be better to build pymoab on top of. @scopatz https://github.com/scopatz might have more to say about that. What I like about this interface so far is that we're providing access to the pure MOAB interface which has been around for a long time with little change.

— Reply to this email directly or view it on GitHub https://github.com/scopatz/pymoab/issues/8#issuecomment-161516262.

scopatz commented 8 years ago

I think history has show that building python tools off of the interface to other languages does not work as well as to the native language. I have also experienced this problem with things like Clang and its C wrapper. Let's at least try going from C++ directly. Not that we shouldn't still pull inspiration from iMOAB, but it might not make sense to be beholden to it.

vijaysm commented 8 years ago

@scopatz Agreed that there is no reason to have another layer of indirection when you can directly call the native C++ interfaces. iMOAB was motivated by needs of physics solvers primarily and there is no need to constrain pyMOAB if interface as it stands does not support all needs.

On Thu, Dec 3, 2015 at 10:41 AM, Anthony Scopatz notifications@github.com wrote:

I think history has show that building python tools off of the interface to other languages does not work as well as to the native language. I have also experienced this problem with things like Clang and its C wrapper. Let's at least try going from C++ directly. Not that we shouldn't still pull inspiration from iMOAB, but it might not make sense to be beholden to it.

— Reply to this email directly or view it on GitHub https://github.com/scopatz/pymoab/issues/8#issuecomment-161680225.

pshriwise commented 8 years ago

Here's what I'm thinking for implementing checks on these different types as they go through pymoab to be set as data.

MB_TYPE_OPAQUE

This ons is pretty straight forward I think. Anything that is character based (dtype.char == 'S'). If the values are too short, then we'll allocate extra space for them as needed in np.asarray. If they are too long then we'll truncate them the correct size with the same method. We should return a warning to the user in this case.

MB_TYPE_INTEGER

This one is a little more complicated. Anything that is int8 can be expanded without a problem. Values which are too large however, and actually take advantage of the int64 space, will be truncated to int32 but wil an completely incorrect value. We should return a warning to the user in this case.

MB_TYPE_DOUBLE

The only way to garauntee that we're getting the correct data here is to ensure that these are of type float64. Floating point converstions aren't nearly as clean as integer conversions. Makes this one pretty straightforward. It should be rare that users are trying to pass any other kind of float data in as float64 is the default type for numpy.

MB_TYPE_BIT

Same as opaque tag but we should allow bool arrays as well as numpy also stores boolean values as bytes. This might be a convenient way for users to pass in bit data. Just need to trust they know what they're doing here.

MB_TYPE_HANDLE

@vijaysm might need to chime in on this one. As Vijay mentioned before, EntityHandles are constructed speically in MOAB s.t. their hex represntation can give direct information about what type they are as well as their ID. As a result it seems the only thing we can really accept here are uint64 types to match our EntityHandle datatype. I think this is alright given that users shouldn't be constructing these EntityHandles by themselves, but rather passing them on as tag values after getting them from our interface.

vijaysm commented 8 years ago

@Pshriwise I think all of your assumptions above are valid. I agree with those.

For MB_TYPE_HANDLE, these should be static in the sense that the user will never have to modify or create this themselves, if we design our interfaces right. So, uint64 would do fine, as long as we do not compile MOAB with --enable-32bit-handles option.

pshriwise commented 8 years ago

Agreed about the 'static' data type @vijaysm.

As far as the 32bit handles go, how might we check the MOAB install for that build option?

pshriwise commented 8 years ago

I'm going to close this (for now). I think we're at a pretty good place with this.

vijaysm commented 8 years ago

As far as the 32bit handles go, how might we check the MOAB install for that build option? ​ The only options here are to look at how MOAB was configured or to inspect sizeof(EntityHandle) to make a decision. This could be a one time check either way.​

On Tue, Dec 8, 2015 at 2:48 PM, Patrick Shriwise notifications@github.com wrote:

Agreed about the 'static' data type @vijaysm https://github.com/vijaysm.

As far as the 32bit handles go, how might we check the MOAB install for that build option?

— Reply to this email directly or view it on GitHub https://github.com/scopatz/pymoab/issues/8#issuecomment-163013549.