Problem with using FLAGS.TABLE to access table

wang502 commented 4 years ago

In src/lib/table_ingest.go, inside functionfunc (cb *SaveBlockChunkCB) CB(digestname string, records RecordList), FLAGS.TABLE is used to access the target table.

But using FLAGS.TABLE incurs a problem when I try to use it as a library instead of command line. If we have several goroutines, and each goroutine calls ingestion on different table, we simply can not guarantee they can access the table they want. In my local test, when I try to ingest data into table A, it happened that some block file is created inside /db/, instead of /db/A/, and final number of records stored in table A is less than the number of records I ingested. After I change the implementation using following way, this issue is gone.

So instead of using FLAGS.TABLE, maybe we can store corresponding table name inside struct SaveBlockChunkCB, and inside CB(), we use t := GetTable(cb.table) to access the table.

Let me know if I'm missing anything. If this is a real problem and the solution sounds viable, I will put up a pull request.

okayzed commented 4 years ago

is there a particular reason you want this change / architecture?

sybil is written as a single invocation binary for a given operation instead of a server process. you'll find as such it uses these flags, instead of passing configuration blobs around. you'll find that this change might work for ingestion but won't work for digestion or querying

there are some pros and cons here.

memory management is easier, for one. similarly a crash or error will only affect one operation. a downside is that all data is on disk.

if you want a server like process or embedded process or a library instead of a Unix style model, the easiest thing is to wrap sybil calls inside a process

i have a long standing task to try and make sybil into a server process with this data in ram architecture. i think many code paths can be shared but it wouldn't make sense to turn sybil (as it stands) into a server, instead to have two binaries with shared functions.

there was previous work 2-3 years ago from someone trying to migrate architecture but i wasn't satisfied with the resulting code. I think now i have a better idea of how it should look. please feel free to poke me on irc or email if you want more specifics or to do a voice chat

okayzed commented 4 years ago

id also gladly accept a library for sybil that wraps sybil cli calls so it can be used in the way you describe.

okayzed commented 4 years ago

the main problem with your approach is if auto digest is initiated, otherwise i think it is fine (but you can use a flag to disable auto digest)

wang502 commented 4 years ago

My architecture right now is followed:

event data are stored in kafka
my server keeps pulling messages from kafka, distribute them to different goroutines, and each goroutine handles ingestion for one sybil table.

For this real time ingestion use case, it makes more sense to expose the ingest/digest api in library? Making it a library call instead of a cli call can be more light weight in terms of system resources. And data has to be in file before it can be ingested via cli.

However I totally get your point of

wrap sybil calls inside a process

and the benefit of

have two binaries with shared functions

Do you have pointers as to how to wrap sybil calls inside a process? And at the same time avoid writing to disk before calling sybil?

okayzed commented 4 years ago

you don't need to write to disk before sending to sybil, you are able to stream over stdin to the sybil proc (as per the standard unix process model)

in general, batching data into sybil is better than reading one record at a time, so its fine to make a buffer of 1k, 2k, 5k or 10k records in memory (or disk) before flushing to sybil - it just depends on how realtime you want your ingestion. kafka (or scribe) definitely make sense and is how the perfpipe_ stuff works.

there's another person who built a simple sybil wrapper for ingesting jaeger events: https://github.com/gouthamve/redbull/blob/master/pkg/redbull/sybil.go, but they had one more caveat which is they built a "virtual table" - they have one large table and they partition it by hour. this was because sybil was not optimized at the time for very large tables (500+mm records per table) and we were seeing digestion slow down as more records were added. this is now fixed and digestion time is not going up as much as the table grows larger. If that example file isn't helpful, I can put together a smaller demo that is neater / easier to use - because their use case was a little bit over-complicated. Really, mostly you just need to use golang's "os/exec" package for calling out to sybil.

in terms of API, i would try designing the API and library you want, then filling it out with the sybil calls. If you want me to help design what the API might look like, I'd be happy to spend time on it with you.

An example API might be like:

table := SybilTable{my_table} // initializer
table.AppendRecords(...) // appends records in memory
...
...
table.Flush()  // actually makes the sybil proc call

and all the actual calls to the sybil proc would be hidden inside the SybilTable class.

I think building the general purpose API will be helpful to other people who have similar use case as you, so I'm glad to invest in it.

One last thing: I'm not sure what the overhead on the spawning a new sybil process is. On the one hand, making a new proc can take 15+ ms, but if you are batching records, it won't be as big of a deal because its happening infrequently per table (once or less per second, potentially)

okayzed commented 4 years ago

After looking at the amount of effort it takes to write a golang wrapper, I think it's kind of painful to write a wrapper (but still doable). I would recommend writing a python or other language wrapper (that reads off kafka and ships to sybil) because it will be much simpler. I can help down that path, too

I will continue investigating both directions: 1) compiled golang wrapper around sybil proc and 2) interpreted wrapper.

logv / sybil

Problem with using FLAGS.TABLE to access table #119