salortiz / LMDB_File

Perl wrapper around the OpenLDAP's LMDB
Other
8 stars 12 forks source link

How to create read-only transaction pool? #17

Closed akotlar closed 8 years ago

akotlar commented 8 years ago

When trying to create more than one transaction, I receive either undef (if using $env->BeginTxn()), or "Transaction active, should be sub transaction" when using LMDB::Txn->new.


my @txns;
  foreach (0 .. $maxReaders - 1) {
    my $txn = LMDB::Txn->new ( $env, MDB_RDONLY );

    $txn->reset();

    push @txns, $txna;
  }
hoytech commented 8 years ago

@salortiz can explain this in more detail, but my understanding is that you are only allowed to have one outstanding transaction at a time per env per thread (not counting sub-transactions).

Salvador mentions this in an RT ticket:

Yes, right now and per thread, the high level Perl API support only one transaction at a time.

Can you provide more details on your use case please? Why do you want to have multiple transactions in the same process at the same time? How many environments have you created? Are you using threads? Was the env created with MDB_NOTLS? I see you are using reset, are you using renew anywhere?

As usual, the most helpful thing is a complete working example that demonstrates the issue.

akotlar commented 8 years ago

This makes sense. I am in general having trouble working with read only transactions in a forking environment (using MCE_Loop). The reason I wanted this pool was to allow the environment + threads to be created in the parent process, then consumed in child processes. However, I don't think this is the correct approach. I've since tried to create a single environment per instance of my db package, and in read-only uses, open a single transaction, that is then reset and renewed. However this leads to seg faults (MDB_NOTLS enabled; ,may go away when disabled). Similarly am confused about how to properly use cursors in such environments (for instance, do I need to call DESTROY, or let the cursor go out of scope? Can I reset a read only transaction that was tied to a cursor before destroying the cursor? What about renewing the cursor?).

My current workaround is to open and commit a new transaction for each bulk request, even when environment opened in read only mode, but this strikes me as improper use.

I think that it would be very useful to provide more extensive examples for read only transactions in documentation.

I'm happy to contribute such documentation, but need a little more information on the subject before i can. I have read several LMDB discussion threads, but code few examples, even in c/ java/ julia / go or other languages.

hoytech commented 8 years ago

allow the environment + threads to be created in the parent process, then consumed in child processes. However, I don't think this is the correct approach

Without knowing all the specifics, I think you're right that this is not a good approach. I don't believe fork()ing an LMDB environment is a good idea, and fork()ing after creating threads is a big no-no (unless you exec or _exit immediately after). In an MP scenario I would create a new environment in every child process -- I don't think this is particularly expensive.

open and commit a new transaction for each bulk request, even when environment opened in read only mode, but this strikes me as improper use.

Can you elaborate on why you think this is an improper use? To me it seems perfectly fine since I tend to think of one transaction encompassing one unit of work (read-only or otherwise).

Regarding reset/renew, I'm sorry I don't have experience with these. Personally I would avoid this unless you have a reason to believe reader table locking is a significant overhead in your app (this overhead is probably less than the cost of a perl method call).

it would be very useful to provide more extensive examples for read only transactions in documentation

Yes, I agree 100% :)

akotlar commented 8 years ago

Thank you, help is appreciated.

As for why improper use, I just mean if the environment is opened in read only state, it seems a waste to _prune the transaction

Also, how do you handle errors? Do you read LMDB_File::last_error after $txn->commit, after each $txn->get/put operation, or do you store the return value from each $txn operation. I am attempting to avoid the extra assignment of the return value (i.e my $return = $txn->get( );, since get operations can happen tens of millions of times for a single program execution, and I am already cpu limited (and the values are stored in LMDB_File::last_err).

hoytech commented 8 years ago

I'm almost always using the default exception-throwing interface. It's a little bit annoying in some cases which aren't really errors per-se (ie when keys don't exist, traversing a cursor, etc), but in my opinion generally preferable to having to check every error code returned.

I've never much worried about the performance of the LMDB_File interface because I've never felt it to be a bottle-neck.

Again, I don't know the specifics of your app, but I would guess that there is lower hanging fruit than minimising error-code assignments. In your other ticket you mention storing values encoded with JSON. For many use-cases there are better serialization formats. To by-pass the decoding step altogether there are things like capn proto (although no perl support yet). Some day I'd love to finish up my work on qstruct which would be ideal when combined with LMDB.

More practically, you might find Sereal interesting. It allows you to compress your data (gzip or snappy) which surprisingly might improve CPU-bound work-loads. For very large data-sets such as yours, page-faulting uncompressed data in can be more expensive than decompression.

akotlar commented 8 years ago

@hotech. Thanks. So if I understand correctly you're using the LDMB_File::last_err (with $die_on_error disabled), after each commit, or maybe after each bulk operation. I'm doing the same.

Regarding JSON: I'm using Data::MessagePack; seems to be the most compact, and approximately the fastest; using JSON our database was ~ 400-500GB, which was a bit too much for our budget (multiple of these, terabytes on ec2 instance storage).

qstruct definitely sounds interesting. One thing I don't like about Data::MessagePack is its handling of floats; all are treated as double precision...which is a huge waste of space; have had to encode single precision floats as strings, which actually ended up being smaller.

hoytech commented 8 years ago

Actually I meant I leave $die_on_error true and rely on exceptions being thrown. Same with regular DBI: I nearly always enable RaiseError.

Interesting regarding the floating point precision. Again, you might want to check out Sereal. I think it's superior to msgpack in most ways. Sereal::Encoder claims to use different FP precisions:

Floating point values can appear to be the same but serialize to different byte strings due to insignificant 'noise' in the floating point representation. Sereal supports different floating point precisions and will generally choose the most compact that can represent your floating point number correctly.

Here are some benchmarks comparing Sereal to msgpack and others:

https://github.com/Sereal/Sereal/wiki/Sereal-Comparison-Graphs

Sereal::Path is also a really neat module: It lets you extract bits without decoding the whole message.

akotlar commented 8 years ago

Thanks, Sereal looks like a potential good alternative.

From LMDB documentation:

"LMDB uses POSIX locks on files, and these locks have issues if one process opens a file multiple times. Because of this, do not mdb_env_open() a file multiple times from a single process. Instead, share the LMDB environment that has opened the file across all threads. Otherwise, if a single process opens the same environment multiple times, closing it once will remove all the locks held on it, and the other instances will be vulnerable to corruption from other processes"

If I fork after I create an environment in the parent, I would have the same environment id in all children; from lmdb's perspective are these children treated like threads, with regard to file locks?

edit: I see under LMDB Caveats

  Use an MDB_env* in the process which opened it, without fork()ing.
hoytech commented 8 years ago

As you noted in your edit, looks like LMDB specifically disallows that. :)

Best to just create the envs once you're all done fork()ing.

akotlar commented 8 years ago

Thank you!