Duplicate records while querying with event Id

arunmahadevan commented 6 years ago

Turned on 100% sampling and then in the event sampling screen I searched with an event ID.

Multiple records are displayed but in the events.log theres only one record with this event id.

2018-01-24 20:05:01.866!_DELIM_!<STREAMLINE_EVENT>!_DELIM_!KAFKA!_DELIM_!75ecde72-
6845-4b0c-90ee-54a7e29d7983!_DELIM_![]!_DELIM_![]!_DELIM_!
{user_id=Iu6AxdBYGR4A0wspR9BYHA, review_id=KPvLNJ21_4wbYNctrOwWdQ, stars=5, 
date=2014-02-13, business_id=5UmKMjUEUNdYWqANhGckJw, type=review, votes={funny=0, 
useful=0, cool=0}}!_DELIM_!{}!_DELIM_!{}

See attached screenshot.

I see similar result (around 25 duplicate records) by directly querying solr. Only the "id" and "version" are different across the records. This may be a solr issue or the way we are querying but need to get it fixed.

http://ctr-e137-1514896590304-34122-01-000002.hwx.site:8886/solr/hadoop_logs/select?
indent=on&wt=json&q=sdi_streamline_event_id:*&fq=type:storm_worker_event&fq=sdi_streamline_topology_id:1&fq=sdi_streamline_event_id:75ecde72-6845-4b0c-90ee-
54a7e29d7983&sort=logtime+asc&start=0&rows=25

{
  "responseHeader":{
    "status":0,
    "QTime":38,
    "params":{
      "q":"sdi_streamline_event_id:*",
      "indent":"on",
      "start":"0",
      "fq":["type:storm_worker_event",
        "sdi_streamline_topology_id:1",
        "sdi_streamline_event_id:75ecde72-6845-4b0c-90ee-54a7e29d7983"],
      "sort":"logtime asc",
      "rows":"25",
      "wt":"json"}},
  "response":{"numFound":228,"start":0,"docs":[
      {
        "cluster":"cl1",
        "level":"INFO",
        "event_count":1,
        "ip":"172.27.31.8",
        "sdi_storm_worker_port":"6700",
        "type":"storm_worker_event",
        "sdi_streamline_root_ids":"[]",
        "seq_num":218898,
        "path":"/var/log/storm/workers-artifacts/streamline-1-test-3-1516755877/6700/events.log",
        "sdi_streamline_component_name":"KAFKA",
        "sdi_streamline_event_id":"75ecde72-6845-4b0c-90ee-54a7e29d7983",
        "sdi_streamline_parent_ids":"[]",
        "sdi_streamline_event_headers":"{}",
        "host":"ctr-e137-1514896590304-34122-01-000002.hwx.site",
        "sdi_streamline_topology_id":"1",
        "sdi_streamline_topology_name":"test",
        "id":"d83a67c7-b216-4f1e-9828-584605856133",
        "sdi_streamline_event_fields_and_values":"{user_id=Iu6AxdBYGR4A0wspR9BYHA, review_id=KPvLNJ21_4wbYNctrOwWdQ, stars=5, date=2014-02-13, business_id=5UmKMjUEUNdYWqANhGckJw, type=review, votes={funny=0, useful=0, cool=0}}",
        "sdi_storm_topology_id":"streamline-1-test-3-1516755877",
        "logtime":"2018-01-24T20:05:01.866Z",
        "event_md5":"1516824301866-731153258720675526",
        "logfile_line_number":2,
        "_ttl_":"+7DAYS",
        "_expire_at_":"2018-01-31T20:05:09.703Z",
        "_version_":1590505567372181505},
      {
        "cluster":"cl1",
        "level":"INFO",
        "event_count":1,
        "ip":"172.27.31.8",
        "sdi_storm_worker_port":"6700",
        "type":"storm_worker_event",
        "sdi_streamline_root_ids":"[]",
        "seq_num":218902,
        "path":"/var/log/storm/workers-artifacts/streamline-1-test-3-1516755877/6700/events.log",
        "sdi_streamline_component_name":"KAFKA",
        "sdi_streamline_event_id":"75ecde72-6845-4b0c-90ee-54a7e29d7983",
        "sdi_streamline_parent_ids":"[]",
        "sdi_streamline_event_headers":"{}",
        "host":"ctr-e137-1514896590304-34122-01-000002.hwx.site",
        "sdi_streamline_topology_id":"1",
        "sdi_streamline_topology_name":"test",
        "id":"02450712-f64a-4927-9988-3abc086fda72",
        "sdi_streamline_event_fields_and_values":"{user_id=Iu6AxdBYGR4A0wspR9BYHA, review_id=KPvLNJ21_4wbYNctrOwWdQ, stars=5, date=2014-02-13, business_id=5UmKMjUEUNdYWqANhGckJw, type=review, votes={funny=0, useful=0, cool=0}}",
        "sdi_storm_topology_id":"streamline-1-test-3-1516755877",
        "logtime":"2018-01-24T20:05:01.866Z",
        "event_md5":"1516824301866-731153258720675526",
        "logfile_line_number":2,
        "_ttl_":"+7DAYS",
        "_expire_at_":"2018-01-31T20:05:09.703Z",
        "_version_":1590505567374278656},
      {
        "cluster":"cl1",
        "level":"INFO",
        "event_count":1,
        "ip":"172.27.31.8",
        "sdi_storm_worker_port":"6700",
        "type":"storm_worker_event",
        "sdi_streamline_root_ids":"[]",
        "seq_num":218910,
        "path":"/var/log/storm/workers-artifacts/streamline-1-test-3-1516755877/6700/events.log",
        "sdi_streamline_component_name":"KAFKA",
        "sdi_streamline_event_id":"75ecde72-6845-4b0c-90ee-54a7e29d7983",
        "sdi_streamline_parent_ids":"[]",
        "sdi_streamline_event_headers":"{}",
        "host":"ctr-e137-1514896590304-34122-01-000002.hwx.site",
        "sdi_streamline_topology_id":"1",
        "sdi_streamline_topology_name":"test",
        "id":"d2573a72-2b0b-4a86-83dc-724ebc1cbaba",
        "sdi_streamline_event_fields_and_values":"{user_id=Iu6AxdBYGR4A0wspR9BYHA, review_id=KPvLNJ21_4wbYNctrOwWdQ, stars=5, date=2014-02-13, business_id=5UmKMjUEUNdYWqANhGckJw, type=review, votes={funny=0, useful=0, cool=0}}",
        "sdi_storm_topology_id":"streamline-1-test-3-1516755877",
        "logtime":"2018-01-24T20:05:01.866Z",
        "event_md5":"1516824301866-731153258720675526",
        "logfile_line_number":2,
        "_ttl_":"+7DAYS",
        "_expire_at_":"2018-01-31T20:05:09.703Z",
        "_version_":1590505567375327235},
...

arunmahadevan commented 6 years ago

Similar behavior is observed while querying with key/value search string.

http://ctr-e137-1514896590304-34122-01-000002.hwx.site:8886/solr/hadoop_logs/select?
indent=on&wt=json&q=sdi_streamline_event_id:*+AND+
(sdi_streamline_event_fields_and_values:stars%3D4,+OR+sdi_streamline_event_headers:stars%3D4,+
OR+sdi_streamline_event_aux_fields_and_values:stars%3D4,)&fq=type:storm_worker_event&fq=sdi_streamline_topology_id:1&fq=logtime:[2018-01-24T19:36:04.194Z+TO+2018-01-
24T20:06:04.194Z]&sort=logtime+asc&start=0&rows=25

{
  "responseHeader":{
    "status":0,
    "QTime":12,
    "params":{
      "q":"sdi_streamline_event_id:* AND (sdi_streamline_event_fields_and_values:stars=4, OR sdi_streamline_event_headers:stars=4, OR sdi_streamline_event_aux_fields_and_values:stars=4,)",
      "indent":"on",
      "start":"0",
      "fq":["type:storm_worker_event",
        "sdi_streamline_topology_id:1",
        "logtime:[2018-01-24T19:36:04.194Z TO 2018-01-24T20:06:04.194Z]"],
      "sort":"logtime asc",
      "rows":"25",
      "wt":"json"}},
  "response":{"numFound":61560,"start":0,"docs":[
      {
        "cluster":"cl1",
        "level":"INFO",
        "event_count":1,
        "ip":"172.27.31.8",
        "sdi_storm_worker_port":"6700",
        "type":"storm_worker_event",
        "sdi_streamline_root_ids":"[]",
        "seq_num":218897,
        "path":"/var/log/storm/workers-artifacts/streamline-1-test-3-1516755877/6700/events.log",
        "sdi_streamline_component_name":"KAFKA",
        "sdi_streamline_event_id":"899cf840-7cd9-42d2-bab4-0ba9e0b294e6",
        "sdi_streamline_parent_ids":"[]",
        "sdi_streamline_event_headers":"{}",
        "host":"ctr-e137-1514896590304-34122-01-000002.hwx.site",
        "sdi_streamline_topology_id":"1",
        "sdi_streamline_topology_name":"test",
        "id":"b1c2b45c-cba4-4f2e-9f8f-77722d592ad9",
        "sdi_streamline_event_fields_and_values":"{user_id=PUFPaY9KxDAcGqfsorJp3Q, review_id=Ya85v4eqdd6k9Od8HbQjyA, stars=4, date=2012-08-01, business_id=5UmKMjUEUNdYWqANhGckJw, type=review, votes={funny=0, useful=0, cool=0}}",
        "sdi_storm_topology_id":"streamline-1-test-3-1516755877",
        "logtime":"2018-01-24T20:05:01.866Z",
        "event_md5":"1516824301866-8905230003759095198",
        "logfile_line_number":1,
        "_ttl_":"+7DAYS",
        "_expire_at_":"2018-01-31T20:05:09.703Z",
        "_version_":1590505567372181504},
      {
        "cluster":"cl1",
        "level":"INFO",
        "event_count":1,
        "ip":"172.27.31.8",
        "sdi_storm_worker_port":"6700",
        "type":"storm_worker_event",
        "sdi_streamline_root_ids":"[]",
        "seq_num":218900,
        "path":"/var/log/storm/workers-artifacts/streamline-1-test-3-1516755877/6700/events.log",
        "sdi_streamline_component_name":"KAFKA",
        "sdi_streamline_event_id":"899cf840-7cd9-42d2-bab4-0ba9e0b294e6",
        "sdi_streamline_parent_ids":"[]",
        "sdi_streamline_event_headers":"{}",
        "host":"ctr-e137-1514896590304-34122-01-000002.hwx.site",
        "sdi_streamline_topology_id":"1",
        "sdi_streamline_topology_name":"test",
        "id":"d12cc99c-a79e-4f3d-95da-3165220f7002",
        "sdi_streamline_event_fields_and_values":"{user_id=PUFPaY9KxDAcGqfsorJp3Q, review_id=Ya85v4eqdd6k9Od8HbQjyA, stars=4, date=2012-08-01, business_id=5UmKMjUEUNdYWqANhGckJw, type=review, votes={funny=0, useful=0, cool=0}}",
        "sdi_storm_topology_id":"streamline-1-test-3-1516755877",
        "logtime":"2018-01-24T20:05:01.866Z",
        "event_md5":"1516824301866-8905230003759095198",
        "logfile_line_number":1,
        "_ttl_":"+7DAYS",
        "_expire_at_":"2018-01-31T20:05:09.703Z",
        "_version_":1590505567373230081},

HeartSaVioR commented 6 years ago

@arunmahadevan If we only have one line for such event in event.log, the issue lays in logfeeder or Solr. Do you see this behavior from other events as well? I don't think we can prevent logfeeder to index event multiple times (though more than 2~3 times should indicate something is going wrong and logfeeder may have bug). We should deduplicate doc based on unique ID, in this case, event ID (assuming we don't encounter UUID collision).

Solr supports deduplication in indexing (indexing becomes upsert), but it is based on schema which we are leveraging hadoop_logs and event id field is also optional dynamic field, then I think we can't apply this. We should have separate collection and schema to have full benefits of Solr, but then we are out of support boundary for Ambari LogSearch service. https://lucene.apache.org/solr/guide/6_6/de-duplication.html

I may be able to deduplicate events from result without deduplicating in Solr, but it will make less docs (less than page size) in page, and even fail beyond the page (deduplicating will work only in page). There's alternative way to do it: fetch more if I deduplicate events, but then paging goes incorrect and start index in query is going to be tricky to handle.

Which way do you want to go ahead? Are we OK to live with SAM side deduplicating?

arunmahadevan commented 6 years ago

I think it may be a solr/logfeeder issue. Probably we can try to get ambari to fix it. Yes we could also have a deduplication logic based on sdi_streamline_event_id, since we know that these are immutable and unique.

Query : http://ctr-e137-1514896590304-34122-01-000002.hwx.site:8886/solr/hadoop_logs/select?indent=on&wt=json&q=sdi_streamline_event_id:*+AND+(sdi_streamline_event_fields_and_values:stars%3D4,+OR+sdi_streamline_event_headers:stars%3D4,+OR+sdi_streamline_event_aux_fields_and_values:stars%3D4,)&fq=type:storm_worker_event&fq=sdi_streamline_topology_id:1&fq=logtime:[2018-01-24T19:36:04.194Z+TO+2018-01-24T20:06:04.194Z]&sort=logtime+asc&start=0&rows=25

Here the response I receive. https://gist.github.com/arunmahadevan/859441b6d5f0b8554665e056d3130cfd

hortonworks / streamline

Duplicate records while querying with event Id #1193