Add ability to search folder without repository

dobrou commented 5 years ago

Feature request description

Add ability to search folder and files without repository. Mount any local folder into Sourcegraph docker container and configure Sourcegraph to take this folder as repository and index it. Or configure as repository folder on remote windows shared drive.

It will miss all the features like history and code intelligence. However even simple fulltext search in Sourcegraph would provide great value.

Is your feature request related to a problem? If so, please describe.

I have folder full of text data like logs. I would like to leverage quick and efficient search in Sourcegraph to be able to search files in this folder.

Describe alternatives you've considered.

Submit data into git repository and configure Sourcegraph to scan the repository. However data are too big (50GB+) and updated every day, so git may not handle this well.

unknwon commented 5 years ago

cc @christinaforney since I think this is more about product decisions.

dadlerj commented 5 years ago

Hi @dobrou! This will actually soon be possible with Sourcegraph with a new tool that's coming soon called src-expose! See the pull request at https://github.com/sourcegraph/sourcegraph/pull/5835/ for context.

src-expose is a tool to periodically snapshot local directories and serve them as Git repositories over HTTP. This is a useful way to get code from other version control systems into Sourcegraph, or textual artifacts from non version controlled systems (eg configuration) into Sourcegraph.

Does this sound like it would help?

unknwon commented 5 years ago

@dadlerj I totally missed that we are going to publish src-expose soon!

But I think @dobrou has addressed his concern about using Git:

However data are too big (50GB+) and updated every day, so git may not handle this well.

dobrou commented 5 years ago

Hi, thank you for quick and useful response.

src-expore documentaion sounds like it should work. My only concern is performance.

speed - In my case 50GB in 1 000 000 text files
size - it seems sourcegraph will not scan files in place, but it will create another copy through src-expose. So it doubles disk usage. If this aspect could be taken into account in src-expose architecture, it would be great.

I will try with insiders build and check how it behaves.

Thanks again.

keegancsmith commented 5 years ago

Are you indexing sourcecode? 50GB is quite large for 1 million files, I assume the distribution of files fits some sort of powerlaw and that 50GB is dominated by a few very large files? Note: git will do fine with 1 million text files, especially if it doesn't update often. Additionally src-expose allows you to shard it across a few git repos (by subdir).

Sourcegraph will always create quite a few copies of your data anyways since, so if hard drive space is a concern we will be an issue. EG: clones will be kept by both src-expose and gitserver. Then our indexing system will create indexes which are bigger than the working copies, and some other systems will cache working copies.

Just for context source code should be quite a bit smaller in general. For example here are some stats for the go code in our main repo (not as many files):

``` shellsession sourcegraph on  core/gitserver-ping [?] ❯ find . -name '*.go' | xargs stat -f%z | histogram.py # NumSamples = 999; Min = 13.00; Max = 427143.00 # Mean = 4877.799800; Variance = 229551931.207167; SD = 15150.971296; Median 2411.000000 # each ∎ represents a count of 13 13.0000 - 42726.0000 [ 995]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 42726.0000 - 85439.0000 [ 2]: 85439.0000 - 128152.0000 [ 0]: 128152.0000 - 170865.0000 [ 1]: 170865.0000 - 213578.0000 [ 0]: 213578.0000 - 256291.0000 [ 0]: 256291.0000 - 299004.0000 [ 0]: 299004.0000 - 341717.0000 [ 0]: 341717.0000 - 384430.0000 [ 0]: 384430.0000 - 427143.0000 [ 1]: sourcegraph on  core/gitserver-ping [?] ❯ find . -name '*.go' | xargs stat -f%z | histogram.py --min 0 --max $((50 * 1024)) # NumSamples = 999; Min = 0.00; Max = 51200.00 # 2 values outside of min/max # Mean = 4877.799800; Variance = 229551931.207167; SD = 15150.971296; Median 2411.000000 # each ∎ represents a count of 9 0.0000 - 5120.0000 [ 725]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 5120.0000 - 10240.0000 [ 165]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 10240.0000 - 15360.0000 [ 58]: ∎∎∎∎∎∎ 15360.0000 - 20480.0000 [ 23]: ∎∎ 20480.0000 - 25600.0000 [ 15]: ∎ 25600.0000 - 30720.0000 [ 7]: 30720.0000 - 35840.0000 [ 2]: 35840.0000 - 40960.0000 [ 0]: 40960.0000 - 46080.0000 [ 2]: 46080.0000 - 51200.0000 [ 0]: ```

dobrou commented 5 years ago

Hi @keegancsmith , files are mostly log files from various sources.

I know there are better specialized solutions for logs handling. And I understand this is not your primary usecase.

Idea is just that Sourcegraph is great in fulltext search (one of many other things), so it looked like solution that could solve my problem and is easy to setup and maintain.

sourcegraph / sourcegraph-public-snapshot