nielsbasjes / logparser

Easy parsing of Apache HTTPD and NGINX access logs with Java, Hadoop, Hive, Flink, Beam, Storm, Drill, ...
Apache License 2.0
158 stars 41 forks source link

List of formats and data types #34

Closed cgivre closed 7 years ago

cgivre commented 8 years ago

Hi Niels, I'm working on adapting your parser for Apache Drill and I was wondering if there is a list somewhere of the fields that the parsers supports and the data types? Thanks,

nielsbasjes commented 8 years ago

Hi,

There is already a ticket for that at the Drill side: https://issues.apache.org/jira/browse/DRILL-3423 As far as I remember there was something in the integration of the two that was very hard to do in Drill. Please check that ticket and have a look at the discussion that took place there.

Niels Basjes

nielsbasjes commented 8 years ago

In addition: Please realize that this parser is pluggable. So the fields it can extract is not a 'static thing'. Also when for example extracting cookies and query string parameters a 'column' can be defined for each possible parameter name you can think of. As a consequence : There is nu such list of fields.

cgivre commented 8 years ago

Hi Niels, Jim Scott was working on it, but got stuck and then didn’t have time to continue. He got it 95% of the way there. I’ve been working on it and the issue that the Drill committers were having is that the field names (HTTP.USERAGENT.user-agent) were not “drill-friendly”. So I’ve made some changes from the Drill side whereby Drill removes the data-type from the user’s view. However, it still needs to map that back to your parser. I hope that makes sense. In other words, we wanted it so that a user could just type:

SELECT request_user-agent FROM

instead of: SELECT HTTP_USERAGENT:request_user-agent FROM

Likewise for the results that are returned. In order to do that, it’s obviously trivial to remove the data-type, but adding it back for the parser, requires a mapping of some sort. That’s why I was asking. Does that make sense? Thanks, — Charles

On Sep 20, 2016, at 10:55, Niels Basjes notifications@github.com wrote:

Hi,

There is already a ticket for that at the Drill side: https://issues.apache.org/jira/browse/DRILL-3423 https://issues.apache.org/jira/browse/DRILL-3423 As far as I remember there was something in the integration of the two that was very hard to do in Drill. Please check that ticket and have a look at the discussion that took place there.

Niels Basjes

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nielsbasjes/logparser/issues/34#issuecomment-248326489, or mute the thread https://github.com/notifications/unsubscribe-auth/AFQfvsl5RFaK0JTYT3MgsR_mNNgZHKLiks5qr_P0gaJpZM4KBsE1.

nielsbasjes commented 8 years ago

So what if my parser gets an additional option to allow the fields to be encoded (for example URL encoded , or something special) to ensure a readable variant that does not have any 'funny' characters.

So my question to you is: What characters are considered to be 'normal' for column names in Drill? I assume at least [a-zA-Z0-9_]

nielsbasjes commented 7 years ago

Is there something regarding this ticket I can do for you at this time?

cgivre commented 7 years ago

BTW, the parser works great with Drill!

nielsbasjes commented 7 years ago

Great to hear. Don't forget I released v3.0 with support for Nginx.