google / timesketch

Collaborative forensic timeline analysis
Apache License 2.0
2.62k stars 589 forks source link

Timesketch_importer duplicates jsonl events (imports twice) #2796

Closed boingomw closed 4 months ago

boingomw commented 1 year ago

Describe the bug timesketch_importer runs twice when executed on json_line files, resulting in double events.

To Reproduce Steps to reproduce the behavior:

  1. Create a sample timeline:
    log2timeline.py --storage_file example.plaso /usr/bin
  2. verify how many lines are in the file:
    pinfo.py example.plaso
  3. import into timesketch: timesketch_importer --host http://127.0.0.1:81 -u examiner -p xxxxxx --sketch_id 19 example.plaso
  4. verify in gui you have 1 import and the correct # of events
  5. convert the plaso to json_line (normally, this is done for a reason, like slicing or something): psort.py -o json_line -w example.jsonl example.plaso
  6. get a word count to verify the # of messages hasn't changed: wc example.jsonl
  7. Import this into timesketch: timesketch_importer --host http://127.0.0.1:81 -u examiner -p xxxxx--sketch_id 19 example.jsonl
  8. Check Gui: example-jsonl 14.1K events (2 imports: details) example 7K events (imported with CLI importer tool)

Expected behavior Expected it to not double import

Screenshots image

Desktop (please complete the following information):

latest docker install, as of 6/15/2023

S-Nicholas commented 1 year ago

Same issue as https://github.com/google/timesketch/issues/2334

berggren commented 1 year ago

@boingomw I see that you have 2 imports reported. This indicates that you ran the importer twice with the same timeline name. That will put the events in the same timeline.

Can you confirm that this isn't the case? I can't reproduce this issue on my end.

boingomw commented 1 year ago

I did run it twice. once for the .json file and once for the .plaso file. The issue is that the files both had the same amount of lines in them, but when I imported the json file, it ended up having 2x the number of events.

so when you do the process above you end up with 7k for the .json and 7k for the .plaso file?

arisjr commented 1 year ago

I can confirm here that timesketch_importer is also creating doubled sources for JSONL imports, doubling the events on searches. The same does not apply to web imports, that seems to import correctly.

# timesketch_importer --version
API Client Version: 20230721
Importer Client Version: 20230721

If you need a sample jsonl, I can supply.

Regards,

jaegeral commented 1 year ago

I just tried it with timesketch --sketch 1 import /usr/local/src/timesketch/temp/sigma_temp.jsonl (the CLI tool) the content of the file being (https://github.com/google/timesketch/blob/master/test_tools/test_events/sigma_events.jsonl):

{"message": "A message","timestamp": 123456789,"datetime": "2015-07-24T19:01:01+00:00","timestamp_desc": "Write time","extra_field_1": "foo"}
{"message": "Another message","timestamp": 123456790,"datetime": "2015-07-24T19:01:02+00:00","timestamp_desc": "Write time","extra_field_1": "bar"}
{"message": "Yet more messages","timestamp": 123456791,"datetime": "2015-07-24T19:01:03+00:00","timestamp_desc": "Write time","extra_field_1": "baz"}
{"message": "Install: zmap:amd64 (1.1.0-1) [Commandline: apt-get install zmap]","timestamp": 123456791,"datetime": "2015-07-24T19:01:03+00:00","timestamp_desc": "foo","command":"Commandline: apt-get install zmap","data_type":"apt:history:line","display_name":"GZIP:/var/log/apt/history.log.1.gz","filename":"/var/log/apt/history.log.1.gz","packages":"Install: zmap:amd64 (1.1.0-1)","parser":"apt_history"}
{"message": "[11 / 0x000b] Source Name: Microsoft-Windows-Sysmon Strings: ['DLL', '2022-01-22 23:07:43.492', '{C784477D-8DE8-61EC-AAAA-000000003C00}', '7812', 'C:\\Windows\\tifubjdl\\lysjbpb.exe', 'C:\\Windows\\itfnduuui\\Corporate\\mimilib.dll', '2022-01-22 23:07:43.492'] Computer Name: DESKTOP-B0TAAAA Record Number: 913 Event Level: 4","computer_name":"DESKTOP-B0TAAAA","data_type":"windows:evtx:record","datetime":"2022-01-22T23:07:43.502205+00:00","display_name":"OS:/data/input/Microsoft-Windows-Sysmon%4Operational.evtx","event_identifier":"11","event_level":"4","message_identifier":"11","parser":"winevtx","source_name":"Microsoft-Windows-Sysmon","timestamp":"1642892863502205","timestamp_desc":"Creation Time" }

And I got a new timeline with 5 events Screenshot 2023-09-25 at 15 10 28

boingomw commented 1 year ago

Maybe it's volume related and 5 isn't enough lines to trigger

arisjr commented 1 year ago

sample.zip

@jaegeral , try this. password: sample123 It has 484 events, but doubles up on importing.

Regards

arisjr commented 1 year ago

@jaegeral I just realized that you're using timesketch cli instead of timesketch-import-client (timesketch_importer). Is there any difference on the approaches?

jaegeral commented 1 year ago

Hm indeed, it is importing them twice.

jaegeral commented 11 months ago

fwiw, I am still working on this, it seems my e2e tests in https://github.com/google/timesketch/pull/2976 does not trigger it.

mari0d commented 4 months ago

Still seeing this bug in the latest version of TS. Looking at the code this flush call isn't needed since the stream close method calls flush() already. We are seeing duplicates because flush() is called twice (and the _data_lines buffer isn't cleared directly by flush() which makes the method name a bit misleading).