simsong / bulk_extractor

This is the development tree. Production downloads are at:
https://github.com/simsong/bulk_extractor/releases
Other
1.04k stars 184 forks source link

Hang: `-R` with 10,000 files and 20 threads on MacBook Pro #400

Closed simsong closed 5 months ago

simsong commented 1 year ago

Something is wrong with the -R file iterator. Redesign so that it:

Likely related to #396

zdavatz commented 7 months ago

Any news here?

simsong commented 7 months ago

Thanks for asking. This has been really causing me a lot of psychic pain but I just haven't gotten to it. If you have a student who can look at this, I can supervise. Otherwise it will need to wait until I finish the book that I'm currently working on, which has to be at the publisher in a few weeks.

simsong commented 5 months ago

@zdavatz - I think that Release 2.1.0 may solve your problem. Can you look and see if your original hang specified a regular expression? Perhaps give me a way to reproduce it?

zdavatz commented 5 months ago

Great, thank you 🙏 !

Download a website with wget -rand then run bulk_extractor on the full dir.

simsong commented 5 months ago

Can you recommend a specific website and then make the archive available? Ideally a us government website so there is no copyright issue? Thanks.


On Thu, Jan 25, 2024 at 2:04 AM Zeno R.R. Davatz @.***> wrote:

Great, thank you 🙏 !

Download a website with wget -rand then run bulk_extractor.

— Reply to this email directly, view it on GitHub https://github.com/simsong/bulk_extractor/issues/400#issuecomment-1909468266, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMFHLD3GMXNQL23R4SCYXLYQH7XXAVCNFSM6AAAAAAWUVUAAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMBZGQ3DQMRWGY . You are receiving this because you were assigned.Message ID: @.***>

zdavatz commented 5 months ago

Sure, try with wget -r https://www.zuerich.ch/content/zh/de/index.html

zdavatz commented 5 months ago

Or you just do a new release and I test the release.

simsong commented 5 months ago

The release is released.


On Thu, Jan 25, 2024 at 6:46 AM Zeno R.R. Davatz @.***> wrote:

Or you just do a new release and I test the release.

— Reply to this email directly, view it on GitHub https://github.com/simsong/bulk_extractor/issues/400#issuecomment-1910015586, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMFHLE5X23YHN3TFC2IZCLYQJA2NAVCNFSM6AAAAAAWUVUAAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGAYTKNJYGY . You are receiving this because you were assigned.Message ID: @.***>

zdavatz commented 5 months ago

Great! Can I already update via Kali Linux?

zdavatz commented 5 months ago

Kali shows me the version: 2.0.6-0kali1

zdavatz commented 5 months ago

Can I grab a binary somewhere?

simsong commented 5 months ago

I don’t maintain Linux distros. You can download the source code and compile it! It’s not hard.


On Thu, Jan 25, 2024 at 8:10 AM Zeno R.R. Davatz @.***> wrote:

Can I grab a binary somewhere?

— Reply to this email directly, view it on GitHub https://github.com/simsong/bulk_extractor/issues/400#issuecomment-1910192536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMFHLBVQI2CFAB5RQ3AOKTYQJKTLAVCNFSM6AAAAAAWUVUAAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQGE4TENJTGY . You are receiving this because you were assigned.Message ID: @.***>

zdavatz commented 5 months ago

On my Gentoo linux I run into the following compile errors, when doing make.

  239 |     const std::filesystem::path get_input_fname() const;
      |                ^~~~~~~~~~
be20_api/scanner_set.h:243:28: Fehler: »filesystem« ist kein Element von »std«
  243 |     const std::vector<std::filesystem::path> &find_files()    const { return sc.find_files(); }
      |                            ^~~~~~~~~~
be20_api/scanner_set.h:243:28: Fehler: »filesystem« ist kein Element von »std«
be20_api/scanner_set.h:243:44: Fehler: Templateargument 1 ist ungültig
  243 |     const std::vector<std::filesystem::path> &find_files()    const { return sc.find_files(); }
      |                                            ^
be20_api/scanner_set.h:243:44: Fehler: Templateargument 2 ist ungültig
In Datei, eingebunden von be20_api/path_printer.cpp:5:
be20_api/scanner_set.h: In Elementfunktion »scanner_set::stats scanner_set::stats::operator+(const scanner_set::stats&)«:
be20_api/scanner_set.h:95:64: Fehler: keine passende Funktion für Aufruf von »scanner_set::stats::stats(scanner_set::stats)«
   95 |             return stats(this->ns + s.ns, this->calls + s.calls);
      |                                                                ^
make[2]: *** [Makefile:1520: be20_api/feature_recorder_file.o] Fehler 1
be20_api/path_printer.cpp: In Elementfunktion »void path_printer::process_http(std::istream&)«:
be20_api/path_printer.cpp:319:52: Fehler: »class abstract_image_reader« hat kein Element namens »image_fname«
  319 |             out << "X-Image-Filename: " << reader->image_fname() << PrintOptions::HTTP_EOL;
      |                                                    ^~~~~~~~~~~
make[2]: *** [Makefile:1520: be20_api/feature_recorder.o] Fehler 1
make[2]: *** [Makefile:1520: be20_api/feature_recorder_set.o] Fehler 1
make[2]: *** [Makefile:1520: be20_api/path_printer.o] Fehler 1
make[2]: Verzeichnis „/home/zeno/.software/bulk_extractor-2.1.0/src“ wird verlassen
make[1]: *** [Makefile:525: all-recursive] Fehler 1
make[1]: Verzeichnis „/home/zeno/.software/bulk_extractor-2.1.0“ wird verlassen
make: *** [Makefile:465: all] Fehler 2
simsong commented 5 months ago

Please upload your config.log. Did you prep a clean VM using one of the prep scripts in the etc directory? What version of Linux are you using? Which compiler? It looks like your compiler doesn’t support C++17 properly.


On Thu, Jan 25, 2024 at 1:00 PM Zeno R.R. Davatz @.***> wrote:

On my linux I run into the following compile errors, when doing make.

239 | const std::filesystem::path get_input_fname() const; | ^~~~~~ be20_api/scanner_set.h:243:28: Fehler: »filesystem« ist kein Element von »std« 243 | const std::vector &find_files() const { return sc.find_files(); } | ^~~~~~ be20_api/scanner_set.h:243:28: Fehler: »filesystem« ist kein Element von »std« be20_api/scanner_set.h:243:44: Fehler: Templateargument 1 ist ungültig 243 | const std::vector &find_files() const { return sc.find_files(); } | ^ be20_api/scanner_set.h:243:44: Fehler: Templateargument 2 ist ungültig In Datei, eingebunden von be20_api/path_printer.cpp:5: be20_api/scanner_set.h: In Elementfunktion »scanner_set::stats scanner_set::stats::operator+(const scanner_set::stats&)«: be20_api/scanner_set.h:95:64: Fehler: keine passende Funktion für Aufruf von »scanner_set::stats::stats(scanner_set::stats)« 95 | return stats(this->ns + s.ns, this->calls + s.calls); | ^ make[2]: [Makefile:1520: be20_api/feature_recorder_file.o] Fehler 1 be20_api/path_printer.cpp: In Elementfunktion »void path_printer::process_http(std::istream&)«: be20_api/path_printer.cpp:319:52: Fehler: »class abstract_image_reader« hat kein Element namens »image_fname« 319 | out << "X-Image-Filename: " << reader->image_fname() << PrintOptions::HTTP_EOL; | ^~~ make[2]: [Makefile:1520: be20_api/feature_recorder.o] Fehler 1 make[2]: [Makefile:1520: be20_api/feature_recorder_set.o] Fehler 1 make[2]: [Makefile:1520: be20_api/path_printer.o] Fehler 1 make[2]: Verzeichnis „/home/zeno/.software/bulk_extractor-2.1.0/src“ wird verlassen make[1]: [Makefile:525: all-recursive] Fehler 1 make[1]: Verzeichnis „/home/zeno/.software/bulk_extractor-2.1.0“ wird verlassen make: [Makefile:465: all] Fehler 2

— Reply to this email directly, view it on GitHub https://github.com/simsong/bulk_extractor/issues/400#issuecomment-1910720324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMFHLDY4S2ADCS4NJH2KRTYQKMUHAVCNFSM6AAAAAAWUVUAAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJQG4ZDAMZSGQ . You are receiving this because you were assigned.Message ID: @.***>

simsong commented 5 months ago

Well, I just ran

wget -r https://www.zuerich.ch/content/zh/de/index.html
src/bulk_extractor -o zuerich-out -R www.zuerich.ch/

Here is the email histogram:

# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 2.1.0
# Feature-Recorder: email
# Filename: www.zuerich.ch
# Histogram-File-Version: 1.1
n=39    information@zuerich.ch
n=4 hotel@zuerich.com
n=4 mail@dominiquemeienberg.ch
n=4 media@zuerich.com
n=2 anna.schindler@zuerich.ch
n=2 groups@zuerich.com
n=2 info@zuerich.com
n=1 foto@umeisser.ch

There are only 170 files. It processed in less than a second. Attached is the report.xml file. report.xml.txt

zdavatz commented 5 months ago

great, thank you!