[Discussion] : Distribution via Bioconda

Continuing from #22

The problem is, that doesn’t work with this code. We immediately get an illegal instruction error. You can see this, e.g. if you pull down the latest piscem from bioconda. One question I had is, what compiler flags do we actually need? For example, rosetta 2 handles many intrinsics fine (e.g. SSE4.2), but not others (e.g. AVX512). I am not certain if it handles BMI2 or not. My thought is, if we figured out what instructions it can handle — we could perhaps add a “CONDA_BUILD” flag to the CMake scripts that would skip march=native, but would still pass the most essential performance flags to the compiler. This would produce (ideally) an x86_64 executable that may not be quite as optimized, but which would run on both x86_64 and M1/2 Macs. Let me know if you have any questions or thoughts on this. Yeah, we should get an idea of what instructions are not permitted there...but anyway I do not think a x86 binary would execute well on ARM.

Ok, we are mixing two separate (although related) things in this thread: 1) support SSHash on ARM (which, now, it should be done!); 2) distribute SSHash via bioconda. For point 2. I do not know since I haven't used bioconda and I'm not familiar with that. But if they do not have any M1/M2 build hosts...then it's not SSHash's (hence, nor our) fault :D Eventually they should get some building environment, I suppose, no? For the moment, we could just warn the users about this bioconda's limitation and ask them to download SSHash directly from GitHub and compile it manually (which is trivial).

The discussion here is in regard to 2. I have two thoughts here.

1) I absolutely agree that, until native M1/2 builds are available from bioconda, it would be better for folks to compile themselves, and for that to be made as easy as possible (could we provide pre-compiled binaries?)

2) Regarding performance, actually, rosetta 2 is pretty amazing in my experience. Even through translation, the M1 (Pro/Max) often outperforms the previous top-end MacBook Pros running i9. My understanding is that rosetta 2 directly translates many of the x86 intrinsics to native Neon intrinsics (or whatever special instructions the M architecture has). While I agree that compilation isn't difficult, I also have a lot of prior experience telling me that my making that statement is very different than the experience a biologist who doesn't focus on software/methods trying to build my tool will have.

Of course, I absolutely understand if you think supporting bioconda builds that run on M1/2 isn't of sufficient priority to warrant effort at this point — we could ask the bioconda people what their path forward and intended timeline is. On the other hand, it would be nice to know what is the delta between what march=native offers and what instructions are actually useful / necessary. It may be that we can remove that flag in Conda builds, explicitly specify the instructions we want, and get little-to-no performance degradation and the ability to distribute something via Conda that works on all platforms (which makes it trivial for people to use both locally and on a cluster).

Hi @rob-p and sorry for the late reply to this.

I'm interested in distributing SSHash via Bioconda (but I first have to learn how to do it). If distribution currently fails because of some compiler flags, a reasonable thing to try is to remove them and see if compilation goes well on bioconda.

Do you know if there is a better way of checking or we have to proceed by trial and error?

Hi @jermp,

No worries — things are busy on this end as well ;P.

So, the issues that I can see are the following (btw, currently the compilation of pisces-cpp that is done when distributing via bioconda works, it's just that the resulting executable may use instructions that are not available on all client machines):

1) march=native refers to the machine that the CI is running on, and so using march=native in a bioconda build gives up strict control over what instructions will be used.

2) The -bmi2 flag uses some instructions that are likely either unavailable or untranslatable on M1/M2 hardware.

So both of these have the following implications. Bioconda builds will likely work on most client machines, but would fail on machines that lack either instructions included because of march=native or machines not supporting bmi2 instructions. Further, a bioconda build built with these flags won't run on an M1/M2 machine, since the current bioconda CI infrastructure is x86-64 and, while rosetta2 can translate most intrinsics, I think it, perhaps, can't translate the bmi2 stuff.

If you just want the bioconda build to work on x86-64, it will probably already work on most machines, but we might want to explicitly list out the useful instructions and remove march=native when building is being done via bioconda (which is easy because you can pass arbitrary variables in the call to cmake). If, in addition, you'd want the current bioconda build to run on M1/M2 under rosetta2 you'd have to get rid of any instruction that rosetta2 can't translate. Of course, since there are not many platforms that one has to support with M1/M2, you could instead just provide your own pre-compiled binaries for those until bioconda gets apple silicon CI instances.

Alright, let's dig into -march=native then! I'll do some research and update this thread.

Ok, my experiences over the last 2 weeks have been helpful here. I think we can just gate march=native on CONDA_BUILD being set as an environment variable. It seems conda assumes a baseline of Haswell, and we can otherwise rely on that (though may want to keep BMI2, as I am not sure where exactly that was introduced).

Following from here: https://wiki.gentoo.org/wiki/Safe_CFLAGS#Find_CPU-specific_options, by doing

giulio@xor:~$ gcc -v -E -x c /dev/null -o /dev/null -march=native 2>&1 | grep /cc1 | grep mtune

I get

 /usr/lib/gcc/x86_64-linux-gnu/11/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu /dev/null -o /dev/null -march=skylake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -msgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=16384 -mtune=skylake -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -dumpdir /dev/ -dumpbase null

on a server, from which we see that arch=skylake there. But I'm still not sure which optimizations it brings.

lol, there are a lot there! So it looks like it does explicitly pull in all of the relevant SSEs up to 4.1 (which I've read before is actually necessary; i.e. telling the compiler SSE4.2 doesn't imply it will also use 4.1 and earlier intrinsics too). It also has the bmi and bmi2 instructions, popcount, mmx, (avx and avx2 — which we probably don't want to require?). There's also aes which may be important depending on which hashing functions are being used; though I am not sure of hardware support for that more broadly (I think it's on all modern x86 chips). Many of these are -mnoX (so turning off support), but the others that are included, I am not familiar with.

I think all that is required in the end can be understood from here https://github.com/jermp/pthash/blob/master/include/encoders/util.hpp -- from some special instructions PTHash uses: popcount and parallel-bit-deposit (or pdep). For popcount: _mm_popcnt_u64 requires SSE4 (and #include <immintrin.h>), but can just use __builtin_popcountll on gcc. For pdep, we actually need BMI2 for best performance but can also run if that flag is not specified.

SSHash by itself does not introduce any further special instruction.

And it would be instructive to compare the performance of both tools, PTHash and SSHash, with and without those compiler flags to see how much they impact. I did it in the past for other libraries and I can confirm both builtin_popcount and pdep were much better than other approaches for rank and select (see also the benchmarks in this paper https://doi.org/10.1016/j.is.2021.101756).

So, I think pdep may be the only critical one there. For popcnt we could try and rely on SIMDE (but, honestly, I don't think there are any machines that don't support SSE4 that we would want to bother running on — the Apple silicon can emulate this with ARM intrinsics I believe).

For pdep — it looks like Intel introduced BMI and BMI2 support at the same time (Haswell and later), while AMD Excavator and newer support BMI2. At this point, we are talking about decade old hardware, so I don't think we should worry about requiring it on the x86 side. I don't know if Apple silicon will translate those instruction or what the effect of trying to compile with -bmi2 on those machines is. I have to imagine there is an ARM/Neon equivalent — perhaps we could just specialize those specific parts of the code to make sure there is always a pdep enabled hot path.

Ok, so it looks we have reduced the problem to

inspect the impact of BMI2 on x86 machines;
and see what are the options for ARM.

Will do it soon.

Hi @rob-p, a small update on this matter: PTHash benchmark with and without options -march=native -mbmi2 -msse4.2.

# Without -march=native -mbmi2 -msse4.2
{"n": "10000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.889000", "searching_seconds": "1.358000", "encoding_seconds": "0.007000", "total_seconds": "2.254000", "pt_bits_per_key": "3.341574", "mapper_bits_per_key": "0.092091", "bits_per_key": "3.433666", "nanosec_per_key": "12.537637"}
{"n": "10000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.890000", "searching_seconds": "0.543000", "encoding_seconds": "0.091000", "total_seconds": "1.524000", "pt_bits_per_key": "3.642861", "mapper_bits_per_key": "0.735875", "bits_per_key": "4.378736", "nanosec_per_key": "21.026188"}
{"n": "10000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.891000", "searching_seconds": "1.952000", "encoding_seconds": "0.014000", "total_seconds": "2.857000", "pt_bits_per_key": "2.252707", "mapper_bits_per_key": "0.092091", "bits_per_key": "2.344798", "nanosec_per_key": "47.303648"}
{"n": "10000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.888000", "searching_seconds": "0.996000", "encoding_seconds": "0.055000", "total_seconds": "1.939000", "pt_bits_per_key": "2.921088", "mapper_bits_per_key": "0.416306", "bits_per_key": "3.337394", "nanosec_per_key": "17.105718"}

# With -march=native -mbmi2 -msse4.2
{"n": "10000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.914000", "searching_seconds": "1.346000", "encoding_seconds": "0.005000", "total_seconds": "2.265000", "pt_bits_per_key": "3.341574", "mapper_bits_per_key": "0.092091", "bits_per_key": "3.433666", "nanosec_per_key": "11.641750"}
{"n": "10000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.923000", "searching_seconds": "0.539000", "encoding_seconds": "0.088000", "total_seconds": "1.550000", "pt_bits_per_key": "3.642861", "mapper_bits_per_key": "0.735875", "bits_per_key": "4.378736", "nanosec_per_key": "17.538620"}
{"n": "10000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.937000", "searching_seconds": "1.946000", "encoding_seconds": "0.011000", "total_seconds": "2.894000", "pt_bits_per_key": "2.252707", "mapper_bits_per_key": "0.092091", "bits_per_key": "2.344798", "nanosec_per_key": "27.216862"}
{"n": "10000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.913000", "searching_seconds": "1.006000", "encoding_seconds": "0.054000", "total_seconds": "1.973000", "pt_bits_per_key": "2.921088", "mapper_bits_per_key": "0.416306", "bits_per_key": "3.337394", "nanosec_per_key": "15.361106"}

In summary: what it is of interest here is the metric nanosec_per_key, i.e., avg. lookup time in nanoseconds per key. It does not change much expect for Elias-Fano, as I expected because its code is the one relying on PDEP for faster selection in a 64-bit word. With the compiler options, Elias-Fano is almost 2X faster.

So I would expect to see a similar effect for SSHash as well because it is using Elias-Fano in a couple of places (but hopefully, not a ~2X slowdown...).

These are the results for 100M keys:

# With -march=native -mbmi2 -msse4.2
{"n": "100000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.179000", "searching_seconds": "21.196000", "encoding_seconds": "0.053000", "total_seconds": "31.428000", "pt_bits_per_key": "3.081809", "mapper_bits_per_key": "0.092021", "bits_per_key": "3.173830", "nanosec_per_key": "27.633028"}
{"n": "100000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.176000", "searching_seconds": "6.951000", "encoding_seconds": "0.939000", "total_seconds": "18.066000", "pt_bits_per_key": "3.311385", "mapper_bits_per_key": "0.735804", "bits_per_key": "4.047189", "nanosec_per_key": "44.560108"}
{"n": "100000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.213000", "searching_seconds": "35.437000", "encoding_seconds": "0.106000", "total_seconds": "45.756000", "pt_bits_per_key": "2.164527", "mapper_bits_per_key": "0.092021", "bits_per_key": "2.256549", "nanosec_per_key": "45.438587"}
{"n": "100000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.192000", "searching_seconds": "15.508000", "encoding_seconds": "0.610000", "total_seconds": "26.310000", "pt_bits_per_key": "2.818595", "mapper_bits_per_key": "0.416232", "bits_per_key": "3.234826", "nanosec_per_key": "36.359295"}

# Without -march=native -mbmi2 -msse4.2
{"n": "100000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.695000", "searching_seconds": "21.136000", "encoding_seconds": "0.066000", "total_seconds": "30.897000", "pt_bits_per_key": "3.081809", "mapper_bits_per_key": "0.092021", "bits_per_key": "3.173830", "nanosec_per_key": "28.539648"}
{"n": "100000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.684000", "searching_seconds": "6.855000", "encoding_seconds": "0.963000", "total_seconds": "17.502000", "pt_bits_per_key": "3.311385", "mapper_bits_per_key": "0.735804", "bits_per_key": "4.047189", "nanosec_per_key": "51.200598"}
{"n": "100000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.746000", "searching_seconds": "35.093000", "encoding_seconds": "0.130000", "total_seconds": "44.969000", "pt_bits_per_key": "2.164527", "mapper_bits_per_key": "0.092021", "bits_per_key": "2.256549", "nanosec_per_key": "79.023305"}
{"n": "100000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.721000", "searching_seconds": "15.411000", "encoding_seconds": "0.627000", "total_seconds": "25.759000", "pt_bits_per_key": "2.818595", "mapper_bits_per_key": "0.416232", "bits_per_key": "3.234826", "nanosec_per_key": "40.438007"}

and are consistent with the others reported before. Elias-Fano went from 45 ns/key to 80 ns/key.

Thanks @jermp!

I wonder if there is a simde equivalent library for the pdep instruction. That is, would it be possible to compile without the flag, but to have a library that provides runtime dispatch using this instruction if available but not otherwise?

I also made the following experiment:

# With -mbmi2 -msse4.2 but without -march=native
{"n": "100000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.772000", "searching_seconds": "20.985000", "encoding_seconds": "0.063000", "total_seconds": "30.820000", "pt_bits_per_key": "3.081809", "mapper_bits_per_key": "0.092021", "bits_per_key": "3.173830", "nanosec_per_key": "28.117325"}
{"n": "100000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.787000", "searching_seconds": "6.809000", "encoding_seconds": "0.949000", "total_seconds": "17.545000", "pt_bits_per_key": "3.311385", "mapper_bits_per_key": "0.735804", "bits_per_key": "4.047189", "nanosec_per_key": "44.683312"}
{"n": "100000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.833000", "searching_seconds": "34.743000", "encoding_seconds": "0.107000", "total_seconds": "44.683000", "pt_bits_per_key": "2.164527", "mapper_bits_per_key": "0.092021", "bits_per_key": "2.256549", "nanosec_per_key": "46.548534"}
{"n": "100000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.770000", "searching_seconds": "14.854000", "encoding_seconds": "0.620000", "total_seconds": "25.244000", "pt_bits_per_key": "2.818595", "mapper_bits_per_key": "0.416232", "bits_per_key": "3.234826", "nanosec_per_key": "36.459911"}

# Without -mbmi2 -msse4.2 but with -march=native
{"n": "100000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.181000", "searching_seconds": "21.743000", "encoding_seconds": "0.053000", "total_seconds": "31.977000", "pt_bits_per_key": "3.081809", "mapper_bits_per_key": "0.092021", "bits_per_key": "3.173830", "nanosec_per_key": "28.107325"}
{"n": "100000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.195000", "searching_seconds": "7.072000", "encoding_seconds": "0.940000", "total_seconds": "18.207000", "pt_bits_per_key": "3.311385", "mapper_bits_per_key": "0.735804", "bits_per_key": "4.047189", "nanosec_per_key": "45.230406"}
{"n": "100000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.241000", "searching_seconds": "35.922000", "encoding_seconds": "0.106000", "total_seconds": "46.269000", "pt_bits_per_key": "2.164527", "mapper_bits_per_key": "0.092021", "bits_per_key": "2.256549", "nanosec_per_key": "45.972021"}
{"n": "100000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.204000", "searching_seconds": "15.220000", "encoding_seconds": "0.611000", "total_seconds": "26.035000", "pt_bits_per_key": "2.818595", "mapper_bits_per_key": "0.416232", "bits_per_key": "3.234826", "nanosec_per_key": "36.728889"}

from which we get the very same performance. This is consistent with the output here

/usr/lib/gcc/x86_64-linux-gnu/11/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu /dev/null -o /dev/null -march=skylake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -msgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=16384 -mtune=skylake -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -dumpdir /dev/ -dumpbase null

which also includes bmi2.

That is, would it be possible to compile without the flag, but to have a library that provides runtime dispatch using this instruction if available but not otherwise?

Would it be possible to detect it via CMake?

Yes. One can have cmake compile arbitrary code and see if it runs. There may even be a cmake module to check for this. For bioconda though, the fear is the host used to compile has this, but runtimte doesn't. In that case one can compile both and dispatch the correct one at runtime. I do this with ksw2 in salmon.

jermp / sshash

[Discussion] : Distribution via Bioconda #25