mspass-team / mspass

Massive Parallel Analysis System for Seismologists
https://mspass.org
BSD 3-Clause "New" or "Revised" License
30 stars 12 forks source link

integer size gotcha in MongoDB #40

Open pavlis opened 4 years ago

pavlis commented 4 years ago

In writing the import function for seispp data I decided to build a dir - dfile -foff set of attributes to allow a direct read of raw data from mongodb. The great news is this works and is super fast. To be specific, I wrote a (more complex than originally expected) python function to load a pair of files my export_to_mspass program produced. That is, it wrote a .yaml file of the metadata and dumped the sample data with simple calls to fwrite. The python function to then put the yaml file in MongoDB was pretty easy. To implement an ability to read the data without resorting to pickle (a nut I haven't cracked yet) I had the function save file offset (foff) values returned by the python tell function. I then wrote the dir, dfile, and foff values into the python dictionary created from the yaml data like this: pyd['dir']=dir pyd['dfile']=dfile ns=pyd['npts'] ns3c=3*ns foff=fh.tell() pyd['foff']=foff wfid=collection.insert_one(pyd).inserted_id

That all worked great. Then wrote a simple reader and got it to work fine on the 2004 USArray data I had converted. Then I go to the 2005 USArray data, and suddenly the reader exits with this unexpected message: Traceback (most recent call last): File "simplereadertest.py", line 27, in print('converted Metadata type of foff=',md.get_long('foff')) RuntimeError: Error in Metadata get method. Type mismatch in attem to get data with key=foff boost::any bad_any_cast wrote this message:
boost::bad_any_cast: failed conversion using boost::any_cast Trying to convert to data of type=int Actual entry has type=double

Where did the double come from I asked? In debugging I had always seen foff return as a bson int.

WELL the gotcha here is the 64 bit barrier. This error was thrown when foff exceeded 2^32. With google I quickly learned that MongoDB defaults all integer data to 32 bit ints. A strange default in the 64 bit word, but we will have to work around this one AND be very conscious of it.

Right now I'm struggling with how to actually handle this. Debug lines in the reader show me lines like this for data read before the 32 bit barrier:

Read data for sta= MPP evid= 192233 mongdb foff value= 2145774168 python type= <class 'int'> type of data with key= iphase is <class 'str'> type of data with key= phase is <class 'str'> type of data with key= sta is <class 'str'> type of data with key= evid is <class 'int'> type of data with key= foff is <class 'int'> type of data with key= npts is <class 'int'> type of data with key= orid is <class 'int'>

where I bolded the line of interest. The read that created the error has this comparable verbose output: Read data for sta= P05C evid= 192233 mongdb foff value= 2147790240 python type= <class 'bson.int64.Int64'> type of data with key= iphase is <class 'str'> type of data with key= phase is <class 'str'> type of data with key= sta is <class 'str'> type of data with key= evid is <class 'int'> type of data with key= foff is <class 'bson.int64.Int64'> type of data with key= npts is <class 'int'>

I guess the reader is going to have handle this and somehow interact with the C++ code to distinguish int32 and int64 things. This is VERY ugly for C as all compilers default ints to int64 these days. It has a lot of downstream complications for MongoDB interactions with the C code. I need to think this over, but if you have any thoughts on how to handle the problem I'd love to hear it. There is a lot of discussion on this topic on the web.

I think what I need to do is figure out a way to guarantee all ints passed to C++ code are promoted to int64, but not at all clear how to do that cleanly and transparently.

pavlis commented 4 years ago

I ran across this site that suggests there may be a setting we can put in the configuration file for MongoDB runing on the docker container. It is old and may no longer be valid but suggests that may be a solution. You are the expert in MongoDB so maybe you can pursue this.

wangyinz commented 4 years ago

I am not exactly sure about this. To my understanding, that setting you referred to in the link seems to be something exclusive to Mongo's PHP API, so it does not apply here.

If you still recall this discussion I think the default behavior of Python is to dynamically allocate memory for the integer type, which means a variable could start as a 32-bit int and become a 64-bit int when needed. From the debug info, I think that's exactly what happened. The foff is just a integer type. When the number is small, it is treated as a 32-bit int, which then got pushed to MongoDB with the same 32-bit <class 'int'>. Then, for the larger ones, it becomes a 64-bit int in Python, and got pushed to MongoDB as <class 'bson.int64.Int64'>, which is Mongo's internal type for 64-bit int. So, it seems this is the correct behavior.

I am not quite sure where it when wrong when reading the metadata from MongoDB. I guess the first thing to test is to use the PyMongo API to retrieve the foff and do a type(foff) to see whether the type is actually correct in Python. Then, we might need to figure out what the added complexity there could be with the C++ code involved.

pavlis commented 4 years ago

You describe exactly the test I did, and you state exactly what I observed and also concluded. That is, MongoDB seems to store integer data as a 32 until it overflows and then promotes it to a 64. It does that automatically, it would seem.

I think my solution will need to involve python functions to convert back and forth from dict to Metadata. The functions will need to have this coded into them somehow. I already have something similar in Metadata where get_int and get_long will both try 32 and 64 bit types before failing. I need to work on it some more tomorrow, but think I can work around this pretty easily and make it pretty transparent. I did think it was important to document this issue here as it may come back to bite us again - possibly more than once.

wangyinz commented 4 years ago

Aha, now I understand the debug info a lot better. This is actually the behavior of the PyMongo API to resolve the issue of Python not distinguishing the bit-length of a integer type while MongoDB does. That's why you found this similar discussion over the PHP API, since there is also a different type incompatibility there.

Anyway, the issue we are having is actually a mix of two here: incompatibility of integer type between 1) Python and MongoDB, and between 2) Python and C++. For issue one, I found this very helpful table here in PyMongo's documentation. Found another discussion in https://github.com/mongodb/genny/pull/246, where they just use Int64 for all.

For issue two, just as you said, I think it should be easy to handle. As long as we have variables in a Python basic datatype, it will be straightforward to convert to C.

pavlis commented 4 years ago

Problem solved, but required two things that I think will hide this from most users.

  1. Part of the issue was how pybind11 handled overloading. As you likely realize python does not have the concept of overloading and pybind11 tricks the interpreter somehow (at least I think that is what does on). The mixed precision data was fooling the overloading, I think, and I found it necessary to add explicit put functions to the Metadata C++ api: put_long, put_double, put_string,put_bool, and (probably redundant) put_int.
  2. I had to use the isinstance builtin function to test each value in name value pairs returned by MongoDB. That works because a bson.int64.Int64 type, say y, returns true when tested with isinstance(y,int). Hence, this short python function will be a core interface method between MongoDB and the C++ library:
    def dict2md(d,mdef):
    md=Metadata()
    for x in d.keys():
      if(x!='_id'):
        y=d[x]
        if(isinstance(y,int)):
          yi=int(y)
          md.put_long(x,yi)
        elif(isinstance(y,float)):
          yd=float(y)
          md.put(x,yd)
        elif(isinstance(y,str)):
          md.put(x,y)   #  no ambiguity in str assume UTF-8
        elif(isinstance(y,bool)):
          md.put(x.y)   # no ambiguity here either
        else:
          print('dict2md:  data with key=',x,' is unsupported type=',type(y))
          print('attribute will not be loaded')
    return md
wangyinz commented 4 years ago

I am a little confused with the short script above. I thought the issue is only in handling integers, so why is the conversion of float necessary here? Other than that, I think this is a pretty clean solution.

btw, as you may not have been aware already, the github editor is actually a Markdown editor, which supports a bunch of formatting syntax. I already edited your previous comment to put the code in a code block. You can learn about it here.

pavlis commented 4 years ago

The reason that confused you is that the above was only my initial prototype and I didn't explain where it was headed. The current version does a second type check with MetadataDefinitions (mdef) and logs an error and tries to enforce (when possible) types defined in mdef. (e.g. something defined as an int can be cleanly converted to a float) Most potential errors can be recovered (e.g. int to float mentioned) but some require charging ahead with a warning (e.g. a string with alpha characters cannot be converted to a number and is just posted as a sting with a warning). Point is the final form should be bombproof and never throw an error.

I probably should put this in a different place, but in working through this I found something I misjudge earler. That is pybind11 actually does convert C++ booleans correctly and python bool values are passed to C in a compatible form. Problem was an VERY subtle feature python. I learned that 'True' and 'true' are not the same. If I pass a True to a wrapped C++ function it runs fine, but if I pass a 'true' it will generate a type error exception.

pavlis commented 4 years ago

Small update on this issue. I ran into a related problem that convinced me that the overall solution is to try as much as possible to only use int64 and when it doubt force it. I ran into a very strange problem that drove me crazy all day. Briefly the problem code was approximately approximately the following:

  foff=read_file(d,mc)   # file reader returns file offset where the data were written (tell output)
  d.put_int('foff',foff) 

That put failed every time until I realized boost::any declared it of type long, which cause a type mismatch error. Problem went away changing that to d.put_long('foff',foff).

I have disabled put_int from the python wrappers to keep this from biting us or users again

wangyinz commented 4 years ago

That is a very tricky issue. If you remember our discussion on #17, we pretty much agreed that we will only use 64-bit types for everything that interfacing with Python. I think it make more sense as 32-bit can almost be considered ancient nowadays.

pavlis commented 4 years ago

The latest changes I checked in this morning lock in the 64 bit integer as much as I know how at this point. I've disabled the get_int and put_int wrappers so only get_long and put_long will be exposed to python. It turns out the C++ getters were robust already with 32 versus 64 for both int and float. I went a step further in the python code that attempts to fix anything feasible. The means the only gotchas really are trying to treat a utf-8 string with nonnumeric characters as a number.

I'm working on a few more features for the mongodb python interface. I think I will rename the module io so one would type import mspasspy.mongodb.io Thought about readwrite but decided that was too verbose and misleading since the library also needs to support updates. Updates are going to take some testing to be sure they are done right.

wangyinz commented 4 years ago

I was looking at the 2004export.* and 2005export.* data sets that you shared earlier and found one potential issue, which is related to what has been discussed here. If you look at the foff field of the last several entries in 2005export.yaml, you will see that they are negative. Tracing it back a little bit, it is obviously caused by a sign flip in a signed 32-bit integer type. I think this is due to defining foff as int in the export_to_mspass program (which is not in this repo). Depends on how extensive this wrong type is used, it might affected the some of the antelope database we built before.

pavlis commented 4 years ago

Weird bug in export2mspass. I think it is a subtle issue with the default definitions for manipulators in the C++ stream operators. The code is clearly defining a long for the writes that have this error and I'm pretty certain the bits in the original long variable are correct. A quick google search suggests this is a known oddity. I'll work to fix this when I get a chance.

pavlis commented 4 years ago

This web page almost certainly describes the problem. The compiler I used to generate the export_to_mspass program must not be using the convention I had thought had become universal for 64 bit systems - what that page calls LP64. I suspect it was compild with LLP64.

I think the solution is to replace "long" in that code with int64_t.

This brings up a related issue. Mongo distinguishes unsigned and signed ints. Not sure how extensive a retrofit that would be, but I need to consider it.

pavlis commented 4 years ago

Dug into this briefly this morning, and am a bit confused why this is happening at all. This is the section of code in export_to_mspass.cc that writes any integer Metadata to the yaml file:

      double dval;
      long ival;
      string sval;
      bool bval;
      Metadata_typedef mtdef=(*mptr);
      switch(mtdef.mdt)
      {
        case MDint:
          ival=d.get<long>(mtdef.tag);
          ofs<<"  "<<mtdef.tag<<": "<<ival<<endl;
          break;
        case MDreal:
          dval=d.get<double>(mtdef.tag);
          ofs<<"  "<<mtdef.tag<<": "<<dval<<endl;
          break;
        case MDboolean:
          bval=d.get<bool>(mtdef.tag);
          ofs<<"  "<<mtdef.tag<<": "<<bval<<endl;
          break;
        case MDstring:
        default:
...

Note it explicitly treats all integers as long. In the LP64 model that implies 64bits, but if the compiler was set to use LLP64 it would be a 32.

I then wrote this simple little C++ test program to see how iostream treated long ints.

#include <iostream>
using namespace std;
int main(int argc, char **argv)
{
    long int a;
    long long int b;
    int i;
    cout << "a and b both set as small integers"<<endl;
    a=42;
    b=42;
    cout << a << " "<<b<<endl;
    a=2000000000000;
    b=a;
    cout << "2 trillion loaded and printed as long int"<<endl;
    cout <<a <<endl;
    cout << "2 trillion loaded and printed as long long int"<<endl;
    cout << b <<endl;
    cout << "2 trillion loaded and printed as int (always overflows)"<<endl;
    cout << i <<endl;
}

Ran this on both a mac and carbonate and got the same result in every attempt:

a and b both set as small integers
42 42
2 trillion loaded and printed as long int
2000000000000
2 trillion loaded and printed as long long int
2000000000000
2 trillion loaded and printed as int (always overflows)
32766

Well, on carbonate the overflow output was 0 instead of 15 bits of all 1s. I used the 9.1 compiler on carbonate and the default 6 and still got this result. Must be a minor difference in architecture, but irrelevant. What is relevant is this does not seem to be linked to using a long in iostream.

There are three remaining explanations I can think of:

  1. The antelope makefile used to build export_to_mspass has a c flag set that is creating this oddity. Testable with a bit of messing with the test program above.
  2. There is something about the templated get interaction with the older seispp library I haven't uncovered. That will be much harder to find.
  3. Maybe there is something about the way I load foff as a signed integer that causes a problem. The C++ library specifies that tellp, which is used to fetch foff, returns type "fpos", but this standard documentation page says clearly this should be converted cleanly to an int or long.

Any other ideas?

wangyinz commented 4 years ago

Well, I don't know... Is there a source code of export_to_mspass.cc that I have access to? It's hard to think of anything without looking at it.

pavlis commented 4 years ago

export_to_mspass is in antelope contrib, BUT I haven't put it into the master repository yet.

Don't think it is your problem to fix this - leave it to me. I want to rebuild antelope contrib on carbonate and see if the problem persists. Don't see this as an must fix immediately problem do you?

wangyinz commented 4 years ago

I agree. We won't be using it in the near term anyway, so just take your time.