Open rob-p opened 1 year ago
Hi @rob-p and sorry for the late reply to this.
I'm interested in distributing SSHash via Bioconda (but I first have to learn how to do it). If distribution currently fails because of some compiler flags, a reasonable thing to try is to remove them and see if compilation goes well on bioconda.
Do you know if there is a better way of checking or we have to proceed by trial and error?
Hi @jermp,
No worries — things are busy on this end as well ;P.
So, the issues that I can see are the following (btw, currently the compilation of pisces-cpp
that is done when distributing via bioconda works, it's just that the resulting executable may use instructions that are not available on all client machines):
1) march=native
refers to the machine that the CI is running on, and so using march=native
in a bioconda build gives up strict control over what instructions will be used.
2) The -bmi2
flag uses some instructions that are likely either unavailable or untranslatable on M1/M2 hardware.
So both of these have the following implications. Bioconda builds will likely work on most client machines, but would fail on machines that lack either instructions included because of march=native
or machines not supporting bmi2
instructions. Further, a bioconda build built with these flags won't run on an M1/M2 machine, since the current bioconda CI infrastructure is x86-64 and, while rosetta2 can translate most intrinsics, I think it, perhaps, can't translate the bmi2 stuff.
If you just want the bioconda build to work on x86-64, it will probably already work on most machines, but we might want to explicitly list out the useful instructions and remove march=native
when building is being done via bioconda (which is easy because you can pass arbitrary variables in the call to cmake
). If, in addition, you'd want the current bioconda build to run on M1/M2 under rosetta2
you'd have to get rid of any instruction that rosetta2
can't translate. Of course, since there are not many platforms that one has to support with M1/M2, you could instead just provide your own pre-compiled binaries for those until bioconda gets apple silicon CI instances.
Alright, let's dig into -march=native
then! I'll do some research and update this thread.
Ok, my experiences over the last 2 weeks have been helpful here. I think we can just gate march=native
on CONDA_BUILD
being set as an environment variable. It seems conda assumes a baseline of Haswell, and we can otherwise rely on that (though may want to keep BMI2, as I am not sure where exactly that was introduced).
Following from here: https://wiki.gentoo.org/wiki/Safe_CFLAGS#Find_CPU-specific_options, by doing
giulio@xor:~$ gcc -v -E -x c /dev/null -o /dev/null -march=native 2>&1 | grep /cc1 | grep mtune
I get
/usr/lib/gcc/x86_64-linux-gnu/11/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu /dev/null -o /dev/null -march=skylake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -msgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=16384 -mtune=skylake -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -dumpdir /dev/ -dumpbase null
on a server, from which we see that arch=skylake
there. But I'm still not sure which optimizations it brings.
lol, there are a lot there! So it looks like it does explicitly pull in all of the relevant SSEs up to 4.1 (which I've read before is actually necessary; i.e. telling the compiler SSE4.2 doesn't imply it will also use 4.1 and earlier intrinsics too). It also has the bmi and bmi2 instructions, popcount, mmx, (avx and avx2 — which we probably don't want to require?). There's also aes
which may be important depending on which hashing functions are being used; though I am not sure of hardware support for that more broadly (I think it's on all modern x86 chips). Many of these are -mnoX
(so turning off support), but the others that are included, I am not familiar with.
I think all that is required in the end can be understood from here https://github.com/jermp/pthash/blob/master/include/encoders/util.hpp -- from some special instructions PTHash uses: popcount and parallel-bit-deposit (or pdep
). For popcount: _mm_popcnt_u64
requires SSE4 (and #include <immintrin.h>
), but can just use __builtin_popcountll
on gcc. For pdep
, we actually need BMI2 for best performance but can also run if that flag is not specified.
SSHash by itself does not introduce any further special instruction.
And it would be instructive to compare the performance of both tools, PTHash and SSHash, with and without those compiler flags to see how much they impact. I did it in the past for other libraries and I can confirm both builtin_popcount
and pdep
were much better than other approaches for rank and select (see also the benchmarks in this paper https://doi.org/10.1016/j.is.2021.101756).
So, I think pdep
may be the only critical one there. For popcnt
we could try and rely on SIMDE
(but, honestly, I don't think there are any machines that don't support SSE4 that we would want to bother running on — the Apple silicon can emulate this with ARM intrinsics I believe).
For pdep
— it looks like Intel introduced BMI and BMI2 support at the same time (Haswell and later), while AMD Excavator and newer support BMI2. At this point, we are talking about decade old hardware, so I don't think we should worry about requiring it on the x86 side. I don't know if Apple silicon will translate those instruction or what the effect of trying to compile with -bmi2
on those machines is. I have to imagine there is an ARM/Neon equivalent — perhaps we could just specialize those specific parts of the code to make sure there is always a pdep
enabled hot path.
Ok, so it looks we have reduced the problem to
Will do it soon.
Hi @rob-p,
a small update on this matter: PTHash benchmark with and without options -march=native -mbmi2 -msse4.2
.
# Without -march=native -mbmi2 -msse4.2
{"n": "10000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.889000", "searching_seconds": "1.358000", "encoding_seconds": "0.007000", "total_seconds": "2.254000", "pt_bits_per_key": "3.341574", "mapper_bits_per_key": "0.092091", "bits_per_key": "3.433666", "nanosec_per_key": "12.537637"}
{"n": "10000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.890000", "searching_seconds": "0.543000", "encoding_seconds": "0.091000", "total_seconds": "1.524000", "pt_bits_per_key": "3.642861", "mapper_bits_per_key": "0.735875", "bits_per_key": "4.378736", "nanosec_per_key": "21.026188"}
{"n": "10000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.891000", "searching_seconds": "1.952000", "encoding_seconds": "0.014000", "total_seconds": "2.857000", "pt_bits_per_key": "2.252707", "mapper_bits_per_key": "0.092091", "bits_per_key": "2.344798", "nanosec_per_key": "47.303648"}
{"n": "10000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.888000", "searching_seconds": "0.996000", "encoding_seconds": "0.055000", "total_seconds": "1.939000", "pt_bits_per_key": "2.921088", "mapper_bits_per_key": "0.416306", "bits_per_key": "3.337394", "nanosec_per_key": "17.105718"}
# With -march=native -mbmi2 -msse4.2
{"n": "10000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.914000", "searching_seconds": "1.346000", "encoding_seconds": "0.005000", "total_seconds": "2.265000", "pt_bits_per_key": "3.341574", "mapper_bits_per_key": "0.092091", "bits_per_key": "3.433666", "nanosec_per_key": "11.641750"}
{"n": "10000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.923000", "searching_seconds": "0.539000", "encoding_seconds": "0.088000", "total_seconds": "1.550000", "pt_bits_per_key": "3.642861", "mapper_bits_per_key": "0.735875", "bits_per_key": "4.378736", "nanosec_per_key": "17.538620"}
{"n": "10000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.937000", "searching_seconds": "1.946000", "encoding_seconds": "0.011000", "total_seconds": "2.894000", "pt_bits_per_key": "2.252707", "mapper_bits_per_key": "0.092091", "bits_per_key": "2.344798", "nanosec_per_key": "27.216862"}
{"n": "10000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "0.913000", "searching_seconds": "1.006000", "encoding_seconds": "0.054000", "total_seconds": "1.973000", "pt_bits_per_key": "2.921088", "mapper_bits_per_key": "0.416306", "bits_per_key": "3.337394", "nanosec_per_key": "15.361106"}
In summary: what it is of interest here is the metric nanosec_per_key
, i.e., avg. lookup time in nanoseconds per key.
It does not change much expect for Elias-Fano, as I expected because its code is the one relying on PDEP for faster selection in a 64-bit word. With the compiler options, Elias-Fano is almost 2X faster.
So I would expect to see a similar effect for SSHash as well because it is using Elias-Fano in a couple of places (but hopefully, not a ~2X slowdown...).
These are the results for 100M keys:
# With -march=native -mbmi2 -msse4.2
{"n": "100000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.179000", "searching_seconds": "21.196000", "encoding_seconds": "0.053000", "total_seconds": "31.428000", "pt_bits_per_key": "3.081809", "mapper_bits_per_key": "0.092021", "bits_per_key": "3.173830", "nanosec_per_key": "27.633028"}
{"n": "100000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.176000", "searching_seconds": "6.951000", "encoding_seconds": "0.939000", "total_seconds": "18.066000", "pt_bits_per_key": "3.311385", "mapper_bits_per_key": "0.735804", "bits_per_key": "4.047189", "nanosec_per_key": "44.560108"}
{"n": "100000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.213000", "searching_seconds": "35.437000", "encoding_seconds": "0.106000", "total_seconds": "45.756000", "pt_bits_per_key": "2.164527", "mapper_bits_per_key": "0.092021", "bits_per_key": "2.256549", "nanosec_per_key": "45.438587"}
{"n": "100000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.192000", "searching_seconds": "15.508000", "encoding_seconds": "0.610000", "total_seconds": "26.310000", "pt_bits_per_key": "2.818595", "mapper_bits_per_key": "0.416232", "bits_per_key": "3.234826", "nanosec_per_key": "36.359295"}
# Without -march=native -mbmi2 -msse4.2
{"n": "100000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.695000", "searching_seconds": "21.136000", "encoding_seconds": "0.066000", "total_seconds": "30.897000", "pt_bits_per_key": "3.081809", "mapper_bits_per_key": "0.092021", "bits_per_key": "3.173830", "nanosec_per_key": "28.539648"}
{"n": "100000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.684000", "searching_seconds": "6.855000", "encoding_seconds": "0.963000", "total_seconds": "17.502000", "pt_bits_per_key": "3.311385", "mapper_bits_per_key": "0.735804", "bits_per_key": "4.047189", "nanosec_per_key": "51.200598"}
{"n": "100000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.746000", "searching_seconds": "35.093000", "encoding_seconds": "0.130000", "total_seconds": "44.969000", "pt_bits_per_key": "2.164527", "mapper_bits_per_key": "0.092021", "bits_per_key": "2.256549", "nanosec_per_key": "79.023305"}
{"n": "100000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.721000", "searching_seconds": "15.411000", "encoding_seconds": "0.627000", "total_seconds": "25.759000", "pt_bits_per_key": "2.818595", "mapper_bits_per_key": "0.416232", "bits_per_key": "3.234826", "nanosec_per_key": "40.438007"}
and are consistent with the others reported before. Elias-Fano went from 45 ns/key to 80 ns/key.
Thanks @jermp!
I wonder if there is a simde
equivalent library for the pdep
instruction. That is, would it be possible to compile without the flag, but to have a library that provides runtime dispatch using this instruction if available but not otherwise?
I also made the following experiment:
# With -mbmi2 -msse4.2 but without -march=native
{"n": "100000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.772000", "searching_seconds": "20.985000", "encoding_seconds": "0.063000", "total_seconds": "30.820000", "pt_bits_per_key": "3.081809", "mapper_bits_per_key": "0.092021", "bits_per_key": "3.173830", "nanosec_per_key": "28.117325"}
{"n": "100000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.787000", "searching_seconds": "6.809000", "encoding_seconds": "0.949000", "total_seconds": "17.545000", "pt_bits_per_key": "3.311385", "mapper_bits_per_key": "0.735804", "bits_per_key": "4.047189", "nanosec_per_key": "44.683312"}
{"n": "100000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.833000", "searching_seconds": "34.743000", "encoding_seconds": "0.107000", "total_seconds": "44.683000", "pt_bits_per_key": "2.164527", "mapper_bits_per_key": "0.092021", "bits_per_key": "2.256549", "nanosec_per_key": "46.548534"}
{"n": "100000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "9.770000", "searching_seconds": "14.854000", "encoding_seconds": "0.620000", "total_seconds": "25.244000", "pt_bits_per_key": "2.818595", "mapper_bits_per_key": "0.416232", "bits_per_key": "3.234826", "nanosec_per_key": "36.459911"}
# Without -mbmi2 -msse4.2 but with -march=native
{"n": "100000000", "c": "7.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "compact-compact", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.181000", "searching_seconds": "21.743000", "encoding_seconds": "0.053000", "total_seconds": "31.977000", "pt_bits_per_key": "3.081809", "mapper_bits_per_key": "0.092021", "bits_per_key": "3.173830", "nanosec_per_key": "28.107325"}
{"n": "100000000", "c": "11.000000", "alpha": "0.880000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.195000", "searching_seconds": "7.072000", "encoding_seconds": "0.940000", "total_seconds": "18.207000", "pt_bits_per_key": "3.311385", "mapper_bits_per_key": "0.735804", "bits_per_key": "4.047189", "nanosec_per_key": "45.230406"}
{"n": "100000000", "c": "6.000000", "alpha": "0.990000", "minimal": "true", "encoder_type": "elias_fano", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.241000", "searching_seconds": "35.922000", "encoding_seconds": "0.106000", "total_seconds": "46.269000", "pt_bits_per_key": "2.164527", "mapper_bits_per_key": "0.092021", "bits_per_key": "2.256549", "nanosec_per_key": "45.972021"}
{"n": "100000000", "c": "7.000000", "alpha": "0.940000", "minimal": "true", "encoder_type": "dictionary-dictionary", "num_partitions": "1", "seed": "1234567890", "num_threads": "1", "external_memory": "false", "partitioning_seconds": "0.000000", "mapping_ordering_seconds": "10.204000", "searching_seconds": "15.220000", "encoding_seconds": "0.611000", "total_seconds": "26.035000", "pt_bits_per_key": "2.818595", "mapper_bits_per_key": "0.416232", "bits_per_key": "3.234826", "nanosec_per_key": "36.728889"}
from which we get the very same performance. This is consistent with the output here
/usr/lib/gcc/x86_64-linux-gnu/11/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu /dev/null -o /dev/null -march=skylake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mno-rdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -msgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=16384 -mtune=skylake -fasynchronous-unwind-tables -fstack-protector-strong -Wformat -Wformat-security -fstack-clash-protection -fcf-protection -dumpdir /dev/ -dumpbase null
which also includes bmi2
.
That is, would it be possible to compile without the flag, but to have a library that provides runtime dispatch using this instruction if available but not otherwise?
Would it be possible to detect it via CMake?
Yes. One can have cmake compile arbitrary code and see if it runs. There may even be a cmake module to check for this. For bioconda though, the fear is the host used to compile has this, but runtimte doesn't. In that case one can compile both and dispatch the correct one at runtime. I do this with ksw2 in salmon.
Continuing from #22
The discussion here is in regard to 2. I have two thoughts here.
1) I absolutely agree that, until native M1/2 builds are available from bioconda, it would be better for folks to compile themselves, and for that to be made as easy as possible (could we provide pre-compiled binaries?)
2) Regarding performance, actually, rosetta 2 is pretty amazing in my experience. Even through translation, the M1 (Pro/Max) often outperforms the previous top-end MacBook Pros running i9. My understanding is that rosetta 2 directly translates many of the x86 intrinsics to native Neon intrinsics (or whatever special instructions the M architecture has). While I agree that compilation isn't difficult, I also have a lot of prior experience telling me that my making that statement is very different than the experience a biologist who doesn't focus on software/methods trying to build my tool will have.
Of course, I absolutely understand if you think supporting bioconda builds that run on M1/2 isn't of sufficient priority to warrant effort at this point — we could ask the bioconda people what their path forward and intended timeline is. On the other hand, it would be nice to know what is the delta between what
march=native
offers and what instructions are actually useful / necessary. It may be that we can remove that flag in Conda builds, explicitly specify the instructions we want, and get little-to-no performance degradation and the ability to distribute something via Conda that works on all platforms (which makes it trivial for people to use both locally and on a cluster).