opensemanticsearch / open-semantic-etl

Python based Open Source ETL tools for file crawling, document processing (text extraction, OCR), content analysis (Entity Extraction & Named Entity Recognition) & data enrichment (annotation) pipelines & ingestor to Solr or Elastic search index & linked data graph database
https://opensemanticsearch.org/etl
GNU General Public License v3.0
254 stars 69 forks source link

Indexing file system visible via nginx? #107

Closed NetwarSystem closed 4 years ago

NetwarSystem commented 4 years ago

I have an Open Semantic Search VM running at vross.netwarsystem.com and I'm trying to index the contents of vrfiles.netwarsystem.com. The content of this site is just a directory tree that is exposed with nginx. When I add the site to OSS, it reports visiting each subdirectory, but it doesn't pick up the PDF files IN the subdirectory.

If I manually provide the entire URL to a file, it seems to run as far as the command line is concerned, but I get no indication that the contents have been added.

https://gist.github.com/NetwarSystem/ffd9c99d8d10e402058186952b89cb84

These are not things that outsiders can see due to Cloudflare Access controls, but I'm sure that the OSS instance can see the vrfiles host. I'm at a loss for what to try next here.

Mandalka commented 4 years ago

So you miss "only" the PDF files in in search index, but the web pages like index page with links to this pdf are indexed/searchable?

Or even these pages are not indexed (i am working on more status info in UI and documentation how to access logs next days)?

The for web crawling used framework Scrapy default settings do not download binary files like PDF, just web pages.

I'll add an option to change the behaviour, so you can set/whitelist/download more file extensions optionally.

Mandalka commented 4 years ago

I removed filters for PDF and office files from default settings.

But in the new default config i did not remove filters for other binaries like images, archives and videos, if crawling external web sites where sometimes GB of binaries linked and traffic is expensive.

You can do it yourself, since from todays new deb package there is now a config file /etc/opensemanticsearch/connector-web where you can see/set/deactivate such denied file extensions / filters for crawling.

But yet its not a good idea to recrawl filesystems with many binaries by the webcrawler, since yet no "file not modified filter" for web crawler and all files will be fully reprocessed at recrawl.

If possible, you can mount the directory via a linux network filesystem and map its ids to http in etl config, so users get the right links but etl can use the file system crawler which does not reprocess files which were not changed.

NetwarSystem commented 4 years ago

OK - that is exactly the problem I am having.

The default of not taking PDF and other binaries is the right thing for web sites. But it would be good to be able to white list sites that we want to get everything.

Our current setup is built to integrate OSS, Atlassian tools, and an nginx instance that provides access to a directory tree. We want to digest the content with OSS, create reports on it using Atlassian's Confluence (wiki), and then have the content served by nginx. The entire system is behind Cloudflare and protected with Cloudflare Access - our analysts are all remote from the site where the servers are located.

The OSS system can access the storage for the nginx file system - you are saying that we can do intake there, but then replace the file:// with http://? Where would I read to learn this?

NetwarSystem commented 4 years ago

I dug through configuration files and found how to do this - thanks for the tip.

It really should be written up somewhere though ... a real headache for me, I can't be the only one.