intel / intel-xpu-backend-for-triton

OpenAI Triton backend for Intel® GPUs
MIT License
124 stars 35 forks source link

Remove decomposition of insert_alice_async in place of spir-v operations #546

Open prathams417 opened 6 months ago

prathams417 commented 6 months ago

Currently the insert_slice_async op is not actually done asynchronously, rather it is decomposed into: // insert_slice_async %src, %dst, %idx, %mask, %other // => // %tmp = load %src, %mask, %other // %res = insert_slice %tmp into %dst[%idx]

Nvidia has already replaced this with their PTX assembly instruction cp.async.cg.shared.global.

We should remove the decomposition in favour of similar operations existing in spir-v: OpGroupAsyncCopy and OpGroupWaitEvents

prathams417 commented 6 months ago

I have been able to generate IR for OpGroupAsyncCopy in the match and rewrite for insert_slice_async. Issues in the spir-v translator are blocking the progression of this task and a ticket has been created with the SPIR-V team

prathams417 commented 4 months ago

Branch with latest changes: prathams/spriv-codegen-async-insertslice

Related Jira ticket has been added to internal wiki under Internal Issues Tracking -> Compiler