Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble

PaulMullowney commented 4 years ago

Hi, I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in HYPRE_IJMatrixAssemble.

First, on a given rank I call HYPRE_IJMatrixSetValues() once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Second, on that same rank, I call HYPRE_IJMatrixAddToValues() once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Then, I call HYPRE_IJMatrixAssemble.

I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.

HYPRE_IJMatrixAddToValues() on the values to add to process 0 ...
HYPRE_IJMatrixAddToValues() on the values to add to process i-1
HYPRE_IJMatrixAddToValues() on the values to add to process i+1 ...
HYPRE_IJMatrixAddToValues() on the values to add to process n

Here each, the input would be row/column sorted with no repeats.

Thanks, -Paul

PaulMullowney commented 4 years ago

One last point. Ultimately I want this on device, so is there a good place to add a cusparse Matrix-Matrix add to complete the assembly process? Perhaps this could give some performance benefit.

liruipeng commented 3 years ago

Hi, I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in HYPRE_IJMatrixAssemble.

First, on a given rank I call HYPRE_IJMatrixSetValues() once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Second, on that same rank, I call HYPRE_IJMatrixAddToValues() once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Then, I call HYPRE_IJMatrixAssemble.

I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.

HYPRE_IJMatrixAddToValues() on the values to add to process 0 ...

HYPRE_IJMatrixAddToValues() on the values to add to process i-1

HYPRE_IJMatrixAddToValues() on the values to add to process i+1 ...

HYPRE_IJMatrixAddToValues() on the values to add to process n

Here each, the input would be row/column sorted with no repeats.

Thanks, -Paul

Hi, @PaulMullowney , your observation that HYPRE_IJMatrixAssemble takes most of the time makes sense to me, since HYPRE_IJMatrixSet/AddToValues just copy the input matrix entries to the internal memory, on-process or off-process. All computations (compression) are done at the assemble stage, where off-process entries are first transferred via MPI and combined with on-process entries, and then all the entries are compressed to make CSR matrices. I don't think the splitting you mentioned is a good idea, since it would require a lot of cudaMemcpy without any benefit of doing it. These are described in the paper: https://github.com/hypre-space/hypre/wiki/pubs/porting-hypre-2020.pdf

PaulMullowney commented 3 years ago

Do you expect either of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSegDiagOffdSizes to make a difference for GPU assembly? -Paul

From: Ruipeng Li notifications@github.com Reply-To: hypre-space/hypre reply@reply.github.com Date: Wednesday, December 2, 2020 at 11:13 PM To: hypre-space/hypre hypre@noreply.github.com Cc: "Mullowney, Paul" Paul.Mullowney@nrel.gov, Mention mention@noreply.github.com Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226)

Hi, I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in HYPRE_IJMatrixAssemble.

First, on a given rank I call HYPRE_IJMatrixSetValues() once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Second, on that same rank, I call HYPRE_IJMatrixAddToValues() once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Then, I call HYPRE_IJMatrixAssemble.

I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.

HYPRE_IJMatrixAddToValues() on the values to add to process 0 ...
HYPRE_IJMatrixAddToValues() on the values to add to process i-1
HYPRE_IJMatrixAddToValues() on the values to add to process i+1 ...
HYPRE_IJMatrixAddToValues() on the values to add to process n

Here each, the input would be row/column sorted with no repeats.

Thanks, -Paul

Hi, @PaulMullowneyhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XKTtbRIoh5VMB3cEOpHbq74quk679QWB6XkJIOmikPQ%3D&reserved=0 , your observation that HYPRE_IJMatrixAssemble takes most of the time makes sense to me, since HYPRE_IJMatrixSet/AddToValues just copy the input matrix entries to the internal memory, on-process or off-process. All computations (compression) are done at the assemble stage, where off-process entries are first transferred via MPI and combined with on-process entries, and then all the entries are compressed to make CSR matrices. I don't think the splitting you mentioned is a good idea, since it would require a lot of cudaMemcpy without any benefit of doing it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-737690225&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ybGQkZSQh0mw%2FtgCjxdzyJyK4YqwaQJmyt8RKLd7zPM%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MTSV5H2ML7LFOWVI6DSS4T7ZANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018929206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rAOjXC1AurAbWaRDurs5TgO4lyFOB1Io2XmwV%2FTZZ4o%3D&reserved=0.

ulrikeyang commented 3 years ago

On the CPU the idea for HYPRE_IJMatrixSetDiagOffdSizes was to set or add to values directly into a parcsr matrix data structure and avoid using an auxiliary data structure. I leave it up to Ruipeng to respond to the GPU question. Ulrike

From: PaulMullowney notifications@github.com Sent: Monday, December 14, 2020 8:34 AM To: hypre-space/hypre hypre@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226)

Do you expect either of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSegDiagOffdSizes to make a difference for GPU assembly? -Paul

From: Ruipeng Li notifications@github.com<mailto:notifications@github.com> Reply-To: hypre-space/hypre reply@reply.github.com<mailto:reply@reply.github.com> Date: Wednesday, December 2, 2020 at 11:13 PM To: hypre-space/hypre hypre@noreply.github.com<mailto:hypre@noreply.github.com> Cc: "Mullowney, Paul" Paul.Mullowney@nrel.gov<mailto:Paul.Mullowney@nrel.gov>, Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226)

Hi, I'm trying to optimize the performance of my matrix assembly in Hypre and I'm finding that all of my time is being spent in HYPRE_IJMatrixAssemble.

First, on a given rank I call HYPRE_IJMatrixSetValues() once on the owned part (rows) of the matrix. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Second, on that same rank, I call HYPRE_IJMatrixAddToValues() once on the part that needs to be accumulated into the owned rows of other ranks. The input to this routine is perfectly row and column sorted and there are no repeated entries, i.e. I give it a CSR matrix.

Then, I call HYPRE_IJMatrixAssemble.

I'm finding that this final Assemble call consumes 80-90% of the run time and I'm wondering if anything can be done. For instance, say I'm on process i for n ranks, would splitting HYPRE_IJMatrixAddToValues() into n-1 calls help, i.e.

HYPRE_IJMatrixAddToValues() on the values to add to process 0 ...
HYPRE_IJMatrixAddToValues() on the values to add to process i-1
HYPRE_IJMatrixAddToValues() on the values to add to process i+1 ...
HYPRE_IJMatrixAddToValues() on the values to add to process n

Here each, the input would be row/column sorted with no repeats.

Thanks, -Paul

Hi, @PaulMullowneyhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XKTtbRIoh5VMB3cEOpHbq74quk679QWB6XkJIOmikPQ%3D&reserved=0 , your observation that HYPRE_IJMatrixAssemble takes most of the time makes sense to me, since HYPRE_IJMatrixSet/AddToValues just copy the input matrix entries to the internal memory, on-process or off-process. All computations (compression) are done at the assemble stage, where off-process entries are first transferred via MPI and combined with on-process entries, and then all the entries are compressed to make CSR matrices. I don't think the splitting you mentioned is a good idea, since it would require a lot of cudaMemcpy without any benefit of doing it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-737690225&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018919245%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=ybGQkZSQh0mw%2FtgCjxdzyJyK4YqwaQJmyt8RKLd7zPM%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MTSV5H2ML7LFOWVI6DSS4T7ZANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C8cee9512435e47560c7608d89752867a%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637425728018929206%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=rAOjXC1AurAbWaRDurs5TgO4lyFOB1Io2XmwV%2FTZZ4o%3D&reserved=0.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/hypre-space/hypre/issues/226#issuecomment-744558610, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AD4NLLIUF5WLXANUDDUYOMTSUY5AXANCNFSM4S7NE7UA.

liruipeng commented 3 years ago

Basically, neither of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSetDiagOffdSizes has effect on GPU assembly, which are used for the CPU algorithm as Ulrike explained. GPU assembly always uses auxiliary matrix.

On the other hand, HYPRE_IJMatrixSetMaxOffProcElmts and hypre_IJMatrixSetMaxOnProcElmtsParCSR (just realized we don't have a HYPRE_ interface for it) may have some effect, not very much as expected, which try to pre-allocate adequate space in the auxiliary matrix. If either is not set, the algorithm starts with some initial allocation and reallocate if being out-of-space by a growing factor. But as you can see, it should not affect the performance too much.

@PaulMullowney we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly.

Thanks!

PaulMullowney commented 3 years ago

I’m gonna take a crack at this. I’ve been reading the assembly code in IJMatrix_parcsr_device.c and I have a pretty good idea of what’s happening. I’m going to create a couple of new methods in that file, IJMatrix_parcsr_device.c, and provide simplified version for my use case.

If successful, we can discuss what’s needed to get into master.

-Paul

From: Ruipeng Li notifications@github.com Reply-To: hypre-space/hypre reply@reply.github.com Date: Monday, December 14, 2020 at 11:09 AM To: hypre-space/hypre hypre@noreply.github.com Cc: "Mullowney, Paul" Paul.Mullowney@nrel.gov, Mention mention@noreply.github.com Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226)

Neither of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSetDiagOffdSizes basically has no effect on GPU assembly, which are used for the CPU algorithm as Ulrike explained. GPU assembly always use auxiliary matrix.

On the other hand, HYPRE_IJMatrixSetMaxOffProcElmts and hypreIJMatrixSetMaxOnProcElmtsParCSR (just realized we don't have a HYPRE interface for it) may have some effect, not very much as expected, which try to pre-allocate adequate space in the auxiliary matrix. If either is not set, the algorithm starts with some initial allocation and reallocate if being out-of-space by a growing factor. But as you can see, it should not affect the performance too much.

@PaulMullowneyhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K8dcUqEtKmYRtl%2BSdEf5Z795xJCpyp%2FpSWOe98Bka2Y%3D&reserved=0 we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-744615314&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xuoa0Vgz9P0CicGWlInuxj9jCFtRRs6ToNkcZ6cK3WA%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MUEZHEEQ3BF3FLKHJLSUZIGDANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807169833%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2c%2F9RRxqMawYxRSk2fZ0AbNGKoIrmeatx5xiErgJigs%3D&reserved=0.

liruipeng commented 3 years ago

I’m gonna take a crack at this. I’ve been reading the assembly code in IJMatrix_parcsr_device.c and I have a pretty good idea of what’s happening. I’m going to create a couple of new methods in that file, IJMatrix_parcsr_device.c, and provide simplified version for my use case. If successful, we can discuss what’s needed to get into master. … -Paul From: Ruipeng Li notifications@github.com Reply-To: hypre-space/hypre reply@reply.github.com Date: Monday, December 14, 2020 at 11:09 AM To: hypre-space/hypre hypre@noreply.github.com Cc: "Mullowney, Paul" Paul.Mullowney@nrel.gov, Mention mention@noreply.github.com Subject: Re: [hypre-space/hypre] Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble (#226) Neither of HYPRE_IJMatrixSetRowSizes or HYPRE_IJMatrixSetDiagOffdSizes basically has no effect on GPU assembly, which are used for the CPU algorithm as Ulrike explained. GPU assembly always use auxiliary matrix. On the other hand, HYPRE_IJMatrixSetMaxOffProcElmts and hypreIJMatrixSetMaxOnProcElmtsParCSR (just realized we don't have a HYPRE interface for it) may have some effect, not very much as expected, which try to pre-allocate adequate space in the auxiliary matrix. If either is not set, the algorithm starts with some initial allocation and reallocate if being out-of-space by a growing factor. But as you can see, it should not affect the performance too much. @PaulMullowneyhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPaulMullowney&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=K8dcUqEtKmYRtl%2BSdEf5Z795xJCpyp%2FpSWOe98Bka2Y%3D&reserved=0 we currently only have one approach of assembling on GPU, which tries to address the most generic situations. So, it may not be the most efficient for you, since it seems that you pretty much have CSR matrices already. We can figure out if we can take some shortcut or even bypass the IJ interface and get ParCSR directly. Thanks! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhypre-space%2Fhypre%2Fissues%2F226%23issuecomment-744615314&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807159878%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=xuoa0Vgz9P0CicGWlInuxj9jCFtRRs6ToNkcZ6cK3WA%3D&reserved=0, or unsubscribehttps://gcc02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAONG4MUEZHEEQ3BF3FLKHJLSUZIGDANCNFSM4S7NE7UA&data=04%7C01%7CPaul.Mullowney%40nrel.gov%7C1fee777e9e0840110f4208d8a05b6b84%7Ca0f29d7e28cd4f5484427885aee7c080%7C0%7C0%7C637435661807169833%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=2c%2F9RRxqMawYxRSk2fZ0AbNGKoIrmeatx5xiErgJigs%3D&reserved=0.

Sounds good. Thank you @PaulMullowney

jamtrott commented 2 years ago

I’m gonna take a crack at this. I’ve been reading the assembly code in IJMatrix_parcsr_device.c and I have a pretty good idea of what’s happening. I’m going to create a couple of new methods in that file, IJMatrix_parcsr_device.c, and provide simplified version for my use case. If successful, we can discuss what’s needed to get into master.

I'm curious if you had any luck with this approach. I'm in a very similar situation myself, where each MPI process already has a full CSR matrix in GPU device memory (stored either as a single CSR matrix or as two separate CSR matrices for the diagonal and offdiagonal parts). I want to provide the data to hypre, hopefully without having to do any unnecessary copying or processing.

Did you find a way to do this, @PaulMullowney ?

PaulMullowney commented 2 years ago

I have a working version of this code in a fork/branch. It's pretty much up to date with Hypre master. I would read my SC 21 paper to get details on the assembly. Key points are as follows.

1) I have data in a COO matrix for all values that are owned by a particular rank (i.e. row_id is in the appropriate range for a given rank) 2) I have data in a COO matrix for all values that need to be shared to other ranks (i.e. row_id is outside the range for a given rank)

The data from 2 is stacked on top of 1. There's a couple of subtle points and a few extra API calls. I'm happy to share if it helps.

-Paul

jamtrott commented 2 years ago

Thanks, Paul. I believe this must be the paper you are referring to: https://www.nrel.gov/docs/fy21osti/79645.pdf. I found it very interesting, and I would greatly appreciate anything you can share.

I suspect that my case may be simpler than yours, since I have arranged for the assembly to be a purely local operation on each MPI process. That is, there is no need to communicate between processes in my case.

Anyway, I will first try by calling HYPRE_IJMatrixSetValues with pointers to the CSR data in device memory, followed by HYPRE_IJMatrixAssemble. I'm assuming that HYPRE_IJMatrixSetValues will copy the data (as @liruipeng mentioned in a comment above), but I'm hoping that HYPRE_IJMatrixAssemble will be able to return immediately without any expensive processing. In the ideal case, I would like to avoid even the copy in HYPRE_IJMatrixSetValues. Do you know if it is possible to directly access device-side pointers to the underlying CSR data structure?

Anyway, perhaps I should open a separate issue, if needed, in case this is too off-topic.

morse129 commented 1 year ago

Hi, I'm wondering if there have been any developments on this, or if you could point me to the modified fork/branch and the details of the API calls.

The code I'm working on uses a similar numerical method to that described in @PaulMullowney's SC 21 paper, which I found quite insightful. I, too, found that the majority of time is being spent in HYPRE_IJMatrixAssemble when passing a pre-assembled CSR matrix to hypre, so I think a similar use of the modified HYPRE_IJMatrixAssemble would be valuable. Is there any plan to incorporate this into the master branch?

I would greatly appreciate any help you could provide with this.

Thanks, Nick

PaulMullowney commented 1 year ago

An implementation is at:

https://github.com/PaulMullowney/hypre/tree/fast_ijassem_device

In the current implementation, you must allocation the COO buffers that are passed in to Set/AddToValues as: [values to set, values to add, extra room] where: values to set = are on processor values values to add = values sent to other processors extra room = space need to receive all the potential values from other ranks

Usage of this is shown in in the following lines below.

https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L883 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L940 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1508 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1616

-Paul

morse129 commented 1 year ago

An implementation is at:

https://github.com/PaulMullowney/hypre/tree/fast_ijassem_device

In the current implementation, you must allocation the COO buffers that are passed in to Set/AddToValues as: [values to set, values to add, extra room] where: values to set = are on processor values values to add = values sent to other processors extra room = space need to receive all the potential values from other ranks

Usage of this is shown in in the following lines below.

https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L883 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L940 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1508 https://github.com/PaulMullowney/nalu-wind/blob/master/src/HypreLinearSystem.C#L1616

-Paul

Thank you very much @PaulMullowney

hypre-space / hypre

Performance of HYPRE_IJMatrixSetValues, HYRPE_IJMatrixAddToValues, HYPRE_IJMatrixAssemble #226