ofiwg / libfabric

Open Fabric Interfaces
http://libfabric.org/
Other
568 stars 380 forks source link

prov/`verbs;ofi_rxm`: `FI_HMEM` fails to be detected when out of cache #9759

Open thomasgillis opened 9 months ago

thomasgillis commented 9 months ago

Hi all,

I am reaching out with a question that might also be an issue in verbs;ofi_rxm: when exceeding the cache size (default is 1024), the address is interpreted as a system address by default in fi_writemsg, instead of a device one. Increasing the size of the cache with FI_MR_CACHE_MAX_COUNT solves the problem.

Is my issue a direct consequence of FI_MR_HMEM and the limit on the cache size? Or is it a missing detection of the pointer type in the provider?

Thanks for your time and your help :-)

nikhilnanal commented 3 months ago

Hi @thomasgillis. I wanted to get some more information about the usage, even better would be if you have a reproducer which you could share which can help understand the usage. Does the application register send and then deregister the mr before the next one or does it register the buffers all at once and then send before deregistering. does the application deregister at all?. Is there any other setting/flags that are set for this test. what device are you using for the test?

thomasgillis commented 1 week ago

Hi @nikhilnanal sorry just seeing this now. I haven't touched the code since April (changed jobs), but IIRC we would register 2048 messages and then execute them all. The reproducer is public, here is the link: https://github.com/pmodels/rmem