fulltextsearch:index hangs, repeating "- RAM: 566.something"

b- commented 6 years ago

I'm trying to perform my first index, and I'm almost definitely stress testing this whole FullTextSearch stack while doing so — I've got a 500 GB dataset.

The data itself is "external storage," which is CIFS storage served from Windows Server 2012 R2, mounted via cifs entries in /etc/fstab as follows:

## Windows Shares
//<server>/Q-Data/shared    /NetShares/v    cifs    credentials=/NetShares/.smbcredentials,iocharset=utf8,sec=ntlm,file_mode=0777,dir_mode=0777 0   0
//<server>/R-Data       /NetShares/r    cifs    credentials=/NetShares/.smbcredentials,iocharset=utf8,sec=ntlm,file_mode=0777,dir_mode=0777 0   0
//<server>/J-Data       /NetShares/j    cifs    credentials=/NetShares/.smbcredentials,iocharset=utf8,sec=ntlm,file_mode=0777,dir_mode=0777 0   0
//<server>/W-Data       /NetShares/w    cifs    credentials=/NetShares/.smbcredentials,iocharset=utf8,sec=ntlm,file_mode=0777,dir_mode=0777 0   0

I've gotten it to index for one user successfully, but then as soon as it gets to the next user it gets stuck filling the RAM (?) once it gets to - RAM: 566.43410491943.

If I try to continue indexing by simply doing occ fulltextsearch:index again this happens:

root@c:/var/ncdata# occ fulltextsearch:index
indexing Files.
 USER: blr@ieitransit.com
- RAM: 23.89949798584
- RAM: 30.564025878906
- RAM: 35.104431152344
- RAM: 37.639022827148
- RAM: 41.782417297363
- RAM: 46.028549194336
- RAM: 49.148544311523
- RAM: 50.502540588379
- RAM: 54.82398223877
- RAM: 59.080146789551
- RAM: 60.91951751709
- RAM: 65.728569030762
- RAM: 66.85343170166
- RAM: 70.643966674805
- RAM: 73.079170227051
- RAM: 77.160293579102
- RAM: 80.570121765137
- RAM: 83.654678344727
- RAM: 91.375785827637
- RAM: 96.923522949219
- RAM: 102.13726043701
- RAM: 110.47479248047
- RAM: 113.87605285645
- RAM: 116.39376831055
- RAM: 119.73052215576
- RAM: 123.79248809814
- RAM: 128.93239593506
- RAM: 132.1827545166
- RAM: 135.75133514404
- RAM: 141.4944152832
- RAM: 149.27917480469
- RAM: 159.47479248047
- RAM: 164.93907928467
- RAM: 169.28715515137
- RAM: 173.63157653809
- RAM: 177.92807769775
- RAM: 181.32400512695
- RAM: 184.85238647461
- RAM: 188.70569610596
- RAM: 192.77071380615
- RAM: 196.81725311279
- RAM: 204.74087524414
- RAM: 207.80700683594
- RAM: 211.59580993652
- RAM: 486.29839324951
- RAM: 508.19902038574
- RAM: 277.52561950684
- RAM: 272.46606445312
- RAM: 279.18186187744
- RAM: 287.55152893066
- RAM: 285.63525390625
- RAM: 291.30284118652
- RAM: 301.78762817383
- RAM: 306.75387573242
- RAM: 310.88928222656
- RAM: 316.43890380859
- RAM: 566.42482757568
- RAM: 566.42482757568
- RAM: 566.42482757568
- RAM: 566.42482757568
- RAM: 566.42482757568
- RAM: 566.42482757568
- RAM: 566.42482757568
- RAM: 566.42482757568
- RAM: 566.42482757568
- RAM: 566.42482757568
- RAM: 566.42482757568

and then it'll keep repeating that last number and going nowhere.

The only relevant entry in the log file is:

{
  "reqId": "M8HADbAjjBympXqBfFxw",
  "level": 1,
  "time": "2018-05-08T11:53:44-04:00",
  "remoteAddr": "",
  "user": "--",
  "app": "admin_audit",
  "method": "--",
  "url": "--",
  "message": "Console command executed: fulltextsearch:index",
  "userAgent": "--",
  "version": "13.0.2.1"
}

I've reset the entire index more than once and started over, but it always ends up hanging, repeating the same - RAM: line over and over. I let it sit for over 24 hours doing so, just to be certain…

I imagine this number itself isn't particularly significant (I have no clue what it even refers to), but nevertheless I don't know where to start troubleshooting more.

I want to reference this issue I'm having, https://github.com/nextcloud/fulltextsearch_elasticsearch/issues/22 because I believe they may be related.

Lastly, while it does sound to me like this may be something to do with resource limits, I have definitely given the VM running Nextcloud / FullTextSearch / ElasticSearch, etc. enough resources: 32 GB RAM and 8 Xeon E5 CPU cores. If it's as simple as telling something it's allowed to use more RAM, so be it.

b- commented 6 years ago

Output from sudo -u www-data php -r 'phpinfo();' https://pastebin.com/mFaAGE45

ArtificialOwl commented 6 years ago

fulltextsearch 0.7 will be release within the next few days with some fix with remote filesystem

b- commented 6 years ago

Awesome! I can't wait to try it out!

b- commented 6 years ago

Not fixed :(

I upgraded to 0.7 (and then 0.7.1), reset the database, and did an index.

I'll leave it going overnight just in case, but it doesn't seem to be going anywhere.

blr@c:/var/www$ sudo -u www-data php occ fulltextsearch:index
[sudo] password for blr:
Could not open input file: occ
blr@c:/var/www$ sudo -u www-data php nextcloud/occ fulltextsearch:index
indexing Files.
 USER: blr@ieitransit.com
- RAM: 23.953216552734
^[^N^[^N- RAM: 27.58837890625
^[^N- RAM: 29.759056091309
- RAM: 33.32982635498
- RAM: 37.234443664551
- RAM: 40.800727844238
- RAM: 43.178802490234
- RAM: 46.336280822754
- RAM: 50.180084228516
- RAM: 53.205558776855
- RAM: 58.456916809082
- RAM: 61.452491760254
- RAM: 65.36141204834
- RAM: 68.055366516113
- RAM: 73.727157592773
- RAM: 77.026557922363
- RAM: 79.525146484375
- RAM: 84.755867004395
- RAM: 87.089782714844
- RAM: 89.403251647949

(snipped out a bunch more RAM lines as it slowly ramped up)

- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521
- RAM: 534.969871521

b- commented 6 years ago

Wait, sorry, I left it overnight and it definitely is indexing now. I'll let it keep going. I have it open in a tmux session so I can disconnect and reconnect at will.

ArtificialOwl commented 6 years ago

I am curious, can you tell me what is the last value you can read while indexing in the - RAM line ?

b- commented 6 years ago

I should have piped it to tee, and I didn't, so I actually don't have that handy before it started indexing.

If there's a way to allow the first index to utilize more RAM, I gave this VM 32 GB…

EDIT: and just to be clear, I'm indexing a 500 GB dataset that's mostly the types of files I'd expect FullTextSearch-Files to be able to comprehend — pdf, doc/docx, xls/xlsx, a few ppt, and so on. So I expect it to take long to do a full index, and if there are any issues with handling large datasets I'm sure I'll make them crop up.

b- commented 6 years ago

I just went to check on things, and it was slowly moving along but then I realized my swap partition was full. I just added an extra 32 GB swap drive (which I couldn't do without shutting down the VM as the controller doesn't support hot-swappable drives), and it's going again.

This time I piped fulltextsearch:index to tee so it'll save the progress to a file, "index.log"

I'll let you know tomorrow what "- RAM: " line it gets to before it gets to the files.

As far as I understand, it should support resuming the first index, right? I didn't do a fulltextsearch:reset after adding the extra swap drive because it was already through at least half the files…

b- commented 6 years ago

It hasn’t started indexing the files yet, but I’m getting a repeated:

- RAM: 988.81209564209
- RAM: 988.81209564209
- RAM: 988.81209564209
- RAM: 988.81209564209
- RAM: 988.81209564209

ArtificialOwl commented 6 years ago

Keep me updated if the index haven't start within the next day. I tried to make it as light as possible, but I still need to retrieve a minimum of data of all your files before starting indexing

b- commented 6 years ago

It's been running for at least 48 hours now, and it's going, but it seems to be hanging at times going back to a bunch of RAM lines.

The thing is, I don't know what it's doing when that's happening, because PHP is using 100% of one core at that time and is either not or barely even hitting MySQL. (Or if it is, it's responding instantly).

I installed Xdebug and started running it with the profiler enabled, but the cachegrind profiling log that it generated was already about 20 GB within minutes. I then disabled the profiler, but had it send stack traces to a log, and that one still filled up my HDD image before anything else even started happening.

I wonder if perhaps my VM is actually too fast to easily catch the bottleneck, because everything else happens literally instantaneously.

I really want to find out what it's doing when this is happening. Right now as I speak I see that it actually finished for one user, but it's been running all day for the next and hasn't gotten to the files yet. Total runtime has been probably about 48-60 hours so far, and it's only gotten to two users.

I imagine that whatever the bottleneck is, it probably can be parallelized…

ArtificialOwl commented 6 years ago

how many files do you have per users ?

b- commented 6 years ago

I don’t know the exact number, but it’s at least half a million (500,000) total shared between all users. I have a few SMB shares filled with hundreds of GB of Microsoft Office documents and PDFs, which are mounted via CIFS and SMB 3.0 in /etc/fstab. The crawl itself takes a long time, and that is to be expected. But I think this is likely a resource contention issue, which should be able to be mitigated one way or another. The server is more than powerful enough, and the bottleneck is in the PHP code. But I don’t know enough to troubleshoot further. On Fri, May 18, 2018 at 9:26 PM Maxence Lange notifications@github.com wrote:

how many files do you have per users ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextcloud/fulltextsearch/issues/311#issuecomment-390368600, or mute the thread https://github.com/notifications/unsubscribe-auth/AARYdWDRFBfw-YypJLSMWbAxF5OuAuPVks5tz3SugaJpZM4T3Rrh .

b- commented 6 years ago

I’m sure Nextcloud FullTextSearch has not been tested with a data set this large. By any means. It’s just a LOT of data. But I hope to be able to help solve this, because I imagine Nextcloud can scale pretty well otherwise (and I know Elasticsearch is quite literally designed to.)

ArtificialOwl commented 6 years ago

The size of the file is not important, but the number of file might be the issue.

The first step for each user is to get the full list of all his files, then the app will pick up a chunk (20 files) from that list, get their content, index the content and will drop the chunk.

I will generate an instance with that much file and see if there is some way to improve this

ArtificialOwl commented 6 years ago

Alright, I made some test with 1M files, and the index started flawlessly, after less than an hour retrieving the files from the user, it start push content into elasticsearch. Also, PHP was not using that much resources.

Files were local; is your setup different ?

b- commented 6 years ago

My files are “external,” CIFS mounted via /etc/fstab and then brought into NextCloud via the “local” external storage option.

On Wed, May 23, 2018 at 5:30 AM Maxence Lange notifications@github.com wrote:

Alright, I made some test with 1M files, and the index started flawlessly, after less than an hour retrieving the files from the user, it start push content into elasticsearch. Also, PHP was not using that much resources.

Files were local; is your setup different ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nextcloud/fulltextsearch/issues/311#issuecomment-391283366, or mute the thread https://github.com/notifications/unsubscribe-auth/AARYdX0ULXBbx9GXh70CeQPqsxEglZb-ks5t1SwmgaJpZM4T3Rrh .

amd-64 commented 6 years ago

Has this been resolved. Indexing grounds to a halt as the index folder size approaches 4.3GB with many files left to index. I am only indexing files that are 1 Mb or less (to speed it up) and only 10 files at a time. Increased PHP memory to 256M.

Is it possible to run the first index manually per user or group.

Still no luck. Any advice.

ArtificialOwl commented 6 years ago

@amd-64 can you describe the setup of your filesystem ?

amd-64 commented 6 years ago

@daita

I use the ubuntu VM. There is about 20 users and 50,000 files mostly pdfs, 50Gb. All files are local and more than half of the files are shared by the admin user to all others so they are only indexed once.

The VM disk filesystem is ext4. The VM has 8Gb RAM, Swap is turned off for indexing. The VM is a dual core but only one core is used for indexing Linux Kernel is 4.8.0-49-generic x86_64. Elasticsearch 6.2.4 and fulltextsearch 0.7.2, fulltextsearch-bookmarks is not installed fulltextsearch-tesseract OCR is not installed redis server running

$ java -version openjdk version "1.8.0_171" OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11) OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

After indexing the first 2-3 users, the process slows down to one user every 30 min or hour, even if they have 10 files only.

I managed to make it finish (partially) after reducing the Maximum file size to 1Mb and reducing the chunk size to 10, both from the admin panel. It took 10 hours to complete the first index, and the index folder size is now 3.5 Gb so it didn't hit the 4.3 Gb wall as previously.

I can provide any of the config files if you need them.

philippe-opendsi commented 6 years ago

@amd-64 @b- I have the same problem on large CIFS server

b- commented 5 years ago

The only reason I stopped commenting on this is because the project I was trying to solve with NextCloud/Elastic stalled. But it definitely never was working right for me.

Now I don’t know too much about modern PHP — my PHP knowledge is mostly from the PHP4 days, and I don’t think this would have even been possible then — but the issue as far as I can see it is threading. You tell it to ingest a bunch of data, and one by one it takes each file, analyzes it, and stores the analysis in the index.

The problem is that these days files are large and plentiful, and CPUs and interconnects are very wide! That means that using one core to read and ingest one file at a time becomes unusably slow if you have a lot of files, NO MATTER HOW FAST YOUR SERVER (and its connection to those files) IS.

I don’t have an actual solution (let alone a pull request, jeez), but as far as I can tell the solution should be to split indexing up into worker threads to handle both the enumeration of the files, and the ingestion of them. What I have in mind is at least one thread crawling the FS, and then that thread goes and spawns new threads for every indexable file but keeps going. You put a limit on how many cores may be used for this, and then crawling isn’t blocked by the ingestion, and it’ll scale across as wide a processor array you’ve got!

Chances are, it nearly always takes longer to index a file than it does to enumerate it and decide to index it (during the crawl). Thus, what we should do is have crawling simply generate a queue of files to be indexed, and based on the number of available threads (itself based on both the current state of the indexing operation — are we ingesting four cores’ worth of data right now? — and the CPU limit — if we _are ingesting four cores worth of data, and there are e.g., 6 files in the to-be-ingested queue, and the crawler uses one thread, you can safely spawn something like one more ingestion thread while the rest are going.

The idea is that given a gigantic dataset crawling should be able to complete literally days before the rest of the indexing, BUT the indexing should be able to scale across CPU cores.

To be clear, when I was doing this, JUST to rule out the speed of my server as a bottleneck… I gave the NextCloud host VM something on the order of 64 GB of RAM and an entire 2.66 GHz 6-core Xeon! and it didn’t index any faster than when I gave it three (non-dedicated) cores and 8 gigs of RAM. And I looked at htop while the index was going, and one core was completely saturated by a php process while the rest were nearly idling...

Edit: also, unless I’m mistaken (and I very well may be), mounting via CIFS and fstab means you’re delegating all the actual file management to kernel modules as if it were a local filesystem. Of course there can be inefficiencies, but kernel CIFS support is mature enough that I don’t think the fact that these files are mounted with CIFS is even relevant. Coincidental, sure, but most enterprise environments I’ve dealt with that don’t store all their data in the cloud have Windows file servers anyway. And again, this shouldn’t even be relevant as it has nothing to do with NextCloud.

b- commented 5 years ago

Another thing is that since RAM usage dramatically increases before anything even gets logged, it seems to me that it’s trying its absolute hardest to do it all at once in RAM. it doesn’t even get anywhere near the total RAM in the system, but it shouldn’t really need to use a lot of RAM to store a list of files to ingest, either. If you approach a limit, back off and slow down until the other threads can catch up :)

ArtificialOwl commented 5 years ago

@b- NC16 should bring some improvement in big setup like yours, let's wait and see !

nextcloud / fulltextsearch

fulltextsearch:index hangs, repeating "- RAM: 566.something" #311