samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

snprintf instead of unsafe sprintf #1586

Closed oleksii-nikolaienko closed 1 year ago

oleksii-nikolaienko commented 1 year ago

HTSLib is a dependency for >20 different R packages including the core ones of Bioconductor. Next R release (v4.3) will issue a warning upon use of sprintf in C/C++ code, which is then treated as an error in CRAN's R CMD check (routine check for all R packages). Besides, this warning is simply annoying and indeed flags a security risk. Suggested solution for this is to substitute all [v]sprintfs with [v]snprintfs. This probably should be done in HTSlib as well as in samtools code. For an experienced C/C++ (HTSlib) developer it is very easy, so I'm not sure it is a good idea to open a PR myself. But please let me know if it is of interest. Thanks for looking into this!

jkbonfield commented 1 year ago

Such warnings are usually bogus and counter-productive.

Htslib and samtools are subject to very rigorous testing, including continuous fuzz testing to look for buffer overflows. We have received reports before of "bugs" due to use of sprintf that are nothing more than false positives by over-zealous security scanning tools.

There are numerous places to change, and to date we haven't deemed it necessary to fix non-security bugs just for the sake of silencing some (IMO broken) third-party tools.

Eg this code: https://github.com/samtools/htslib/blob/develop/sam.c#L5393-L5395

                        char qual[20];
                        if (mod[j].qual >= 0)
                            sprintf(qual, "%d", mod[j].qual);

Yes we could change it to snprintf instead, but should we blindly change all such uses? What other impact will it have? Speed maybe due to additional checks? (Although for performant things we usually use kstring instead.) It's clearly wrong to label this a security issue.

(Edit: accepted though if the world ever switches to a CPU where "int" is 64-bit by default, then such code would need fixing. I think this is a long way off, if ever, and there will be many many greater problems to fix first.)

oleksii-nikolaienko commented 1 year ago

I'm afraid that not only R's behavior can be an issue. The deprecation of sprintf in OSX already affected many projects. Some large ones (OpenJDK, Boost libraries) have already replaced sprintfs with alternatives. Maybe it is unavoidable for HTSlib as well.

I have no knowledge and cannot comment on speed or other potential impacts or security issues.

jkbonfield commented 1 year ago

Ugh, that's "courageous" of them! While they may wish to deprecate it, in doing so they are departing from the C standards which have not deprecated these functions. I assume they also include strcpy too?

My concern is blindly changing one function for another function doesn't in itself lead to more secure code. While it's safer in general, it's not foolproof and it still punts the onus on setting the correct length to the programmer. If you trust the programmer, you maybe can also trust they know how to allocate memory correctly for sprintf too. Additionally every code change comes with risks of new bugs introduced, so change for the sake of change potentially has a higher security risk than doing nothing.

That said, if this is just the norm now, then grrr! Maybe we will indeed have to do something (even if it's just turning off warnings on MacOS!). Sigh

jkbonfield commented 1 year ago

Out of interest, how are you compiling htslib? What compiler for example?

We're doing CI on ARM Ventura macos and it's building without warnings. I'm wondering if this warning only exists when using a C++ compiler or similar.

See https://cirrus-ci.com/task/5540763010859008?logs=compile#L95 for our logs. I've no idea how we tell which version of xcode is the default on Ventura.

Edit: even explicitly asking for xcode 14 doesn't trigger these: https://cirrus-ci.com/task/5232657760518144 I suspect this is an error caused by attempting to build C code with a C++ compiler, but please clarify with some build logs for us to look at.

oleksii-nikolaienko commented 1 year ago

In R, I am linking to its prepackaged version, Rhtslib. Then my (and many other packages) can be compiled upon installation on user's machines (any of Win/Linux/OSX) using default compiler toolchain without manually downloading and installing HTSlib.

The warnings I complain about are R warnings during routine check at CRAN's and Bioconductor's CI. Even though I use recent Mac (and Linux), I don't see any compiler warnings myself (yet). However, since other projects are substituting sprintfs with snprintfs (due to Apple's and/or R's policies), and those CI warnings are annoying, I thought it might be a good idea to do the same with remaining sprintfs in HTSlib as well.

Here's how R CMD check (not a compiler) warnings look for Rhtslib itself and for one of the other core Bioconductor packages, VariantAnnotation. Again, nothing major, but rather a sign of future troubles. And while Bioconductor still permits them, it is already not allowed on CRAN.

jkbonfield commented 1 year ago

Ah, I see this:

* checking compiled code ... WARNING
Note: information on .o files is not available
File ‘/home/biocbuild/bbs-3.17-bioc/R/site-library/Rhtslib/libs/Rhtslib.so’:
  Found ‘__sprintf_chk’, possibly from ‘sprintf’ (C)
  Found ‘abort’, possibly from ‘abort’ (C)
  Found ‘exit’, possibly from ‘exit’ (C)
  Found ‘stderr’, possibly from ‘stderr’ (C)
  Found ‘stdout’, possibly from ‘stdout’ (C)

So basically it's doing a symbol table scan on the object files, and this has nothing specifically to do with MacOS deprecation by the looks of it. I think that's maybe a red herring here.

Removing some of the exit/abort calls is on our todo list, but mostly they're deprecated APIs which we cannot change (as they don't have any way to return error codes and we have newer interfaces which people should use instead), or they're cases which are programming checks rather than user-input checks, so nothing the user can do could trigger these unless we've code a huge coding error, in which case all bets are off at that point.

Stdout/stderr will be diagnostics, some of it is logging which can be increased with verbosity, but some is indeed problematic as we ought to have a better way of returning errors as integers with a hts_strerror type interface to get string forms. That mostly hasn't been written yet. However it still won't silence R warnings here as we're always going to have the debug symbols there for when people want to diagnose problems.

Silencing just the sprintf cases doesn't seem like a priority in the grand scheme of things, given all the other warnings will still be there and the R build logs will continue to flag as "WARNINGS". It's unrealistic to change everything that triggers R complaints, and doing so would require a new incompatible ABI and numerous updates to other tools.

oleksii-nikolaienko commented 1 year ago

Just a note: it is only the [v]sprintf that triggers these warnings. Otherwise it'd be a note. Confirmation is in the R source code (src/library/tools/R/check.R), but I have no URL to link to... That's why I thought it might be an acceptable fix for HTSlib and easy solution for me (and other R package developers)

daviesrob commented 1 year ago

It looks like there's only 10 instances of sprintf() in the HTSlib source. It wouldn't be too difficult to get rid of them, if only to make life quieter.

jkbonfield commented 1 year ago

Do you have a list of functions that gets scanned for? Do we need to change all strcpy to strncpy too for example (equally capable of having buffer overflows as sprintf)?

jkbonfield commented 1 year ago

It looks like there's only 10 instances of sprintf() in the HTSlib source. It wouldn't be too difficult to get rid of them, if only to make life quieter.

I'd already done this search and found 34, although 14 are in test.

oleksii-nikolaienko commented 1 year ago

R v4.2 did not warn on use of stdout/stderr/assert/abort/exit (for example: v4.2 log vs v4.3 log). It is only the [v]sprintf which became a problem in v4.3

Edit: I guess the reason is to prepare for eventual compilations errors on OS X due to use of "unsafe" (in Apple's opinion) code

daviesrob commented 1 year ago

You should know that strncpy() is not a suitable replacement for strcpy()...

For another data point, curl bans sprintf, vsprintf, strcat, strncat (plus a few variants) and gets. (It also bans gmtime and localtime due to lack of thread-safety).

jkbonfield commented 1 year ago

I know strncpy needs extra work compared to strcpy due to the null-termination status, but it absolutely can be used as a replacement with care. Why wouldn't it? Maybe people still deem it to be unsafe though so prefer something else? I can't second guess the reasoning, which is why I asked.

But if it's just sprintf and vsprintf then it's doable. I still argue though that it's pointless effort! A tick box exercise for people who like to be seen to do something. There are millions of way code can be broken. sprintf can trivially be secure (unlike gets which can never). But I guess I'm just swimming against the tide on this one.

Edit: on the bright side, it looks like snprintf only adds 2-3% extra CPU overhead, so it's not a concern for heavily used things like kputd (and the additional s->m - s->l isn't likely to be an issue either).

jmarshall commented 1 year ago

Edit: I guess the reason is to prepare for eventual compilations errors on OS X due to use of "unsafe" (in Apple's opinion) code

Do you have a link or other citation to Apple documentation describing such a policy?

oleksii-nikolaienko commented 1 year ago

Can't find any proper description or discussion from Apple (just this), only changes in open source projects. There are threads with complaints that Xcode 14 gives an error for OSX developers, Google finds them too. Issue to give the same error for sprintf in plain C code in addition to C++. Chromium, libtiff, CMake affected.

I guess, when Apple changes something they don't really discuss it with the community. People then just have to decide if they want to drop OSX support or adjust accordingly...

oleksii-nikolaienko commented 1 year ago

Thanks for accepting this issue. I thought of a couple of things and please forgive me if I'm saying something very obvious to you:

  1. I guess this substitution should also be performed in samtools as well. But there's no need to create another issue, or is there?
  2. It would probably make sense to disallow future use of sprintf so then they don't sneak back in. Of curiosity, I tried to find a solution and ended up with two compiler-dependent pragmas (below). Is there another, better way to deprecate a standard function?
#pragma GCC poison sprintf vsprintf
#pragma clang deprecated sprintf vsprintf

Would C++14 attributes work instead?

[[deprecated("please use snprintf instead")]]
int sprintf( char* buffer, const char* format, ... );

[[deprecated("please use vsnprintf instead")]]
int vsprintf( char* buffer, const char* format, va_list vlist );
jkbonfield commented 1 year ago

We can't use C++14 in our C code. However we already have HTS_DEPRECATED in hts_defs.h as a compiler-specific definition.

If this is needed in samtools and/or bcftools they ought to have their own issues too, as we'd probably not do them all in a single step and at the very least it'll obviously be different commits. I don't see the need for samtools/bcftools though as they're not a library and can't be wrapped up by R / Bioconductor can they?

oleksii-nikolaienko commented 1 year ago

Not sure why, but R package contains some *.c and *.h files from samtools as well. Probably their functionality is needed for some core Bioconductor packages.

jmarshall commented 1 year ago

The R package's samtools code is there because they included a few functions they considered generally useful from the samtools code (rather than raising HTSlib issues to discuss adding such functions or equivalents to the HTSlib library) and they resurrected the legacy Samtools API, both presumably for the benefit of other Bioconductor packages still using such legacy APIs from the Rsamtools days.

This is a copy of (a subset of) the samtools code within the Rhtslib repository. So the R developers would be free to tweak their local copy to avoid sprintf; there's no implied need for upstream samtools to do so too.

jmarshall commented 1 year ago

Several of the sprintf invocations in HTSlib are doing file path manipulations.

Instead of snprintf, they could be replaced with kstrings[^1] or other malloced strings and eventually get rid of PATH_MAX from the HTSlib source too. This would make one of Debian's patches obsolete and thus improve portability — Hurd doesn't have PATH_MAX.

[^1]: I have a draft of some of that, but it's pretty trivial stuff anyway.

oleksii-nikolaienko commented 1 year ago

I see. Thanks for the explanation

daviesrob commented 1 year ago

1594 gets rid of the sprintf() calls. Dealing with PATH_MAX looks a bit harder so is probably best done separately.

I haven't attempted to ban it in the code yet. Both pragmas and deprecations may be a bit messy because they have to be applied at exactly the right time, between including the system headers and any HTSlib ones. For now we can rely on maintainer vigilance, but it could be possible to scan for it using nm in our CI tests.

oleksii-nikolaienko commented 1 year ago

Thanks very much, guys. And for such a great library too

jmarshall commented 1 year ago

For the record, here is what the current (13.3, 22E245) macOS SDK's <stdio.h> has to say about sprintf:

__swift_unavailable("Use snprintf instead.")
#if !defined(_POSIX_C_SOURCE)
__deprecated_msg("This function is provided for compatibility reasons only.  Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.")
#endif
int  sprintf(char * __restrict, const char * __restrict, ...) __printflike(2, 3);

// …

#if defined (__GNUC__) && _FORTIFY_SOURCE > 0 && !defined (__cplusplus)
/* Security checking functions.  */
#include <secure/_stdio.h>
#endif

secure/_stdio.h #defines sprintf to a checking version with a different name, which sidesteps the earlier declaration that has the deprecation message. Hence for C (as opposed to C++) code, the deprecation warning will only appear if both of the following are true:

  1. You are building plain macOS code, as opposed to POSIX-using code that would cause _POSIX_C_SOURCE to be defined.
  2. You have overridden the default value (2 on macOS) of _FORTIFY_SOURCE and set it to 0. (Mostly no-one does this, so this means that just about nobody should see this deprecation message. However using AddressSanitizer happens to incidentally set it 0.)

(See libtiff's useful summary and links for details.)

For HTSlib, this means that builds would normally never see this deprecation. AddressSanitizer builds, which are inherently non-production builds, would at present get the deprecation warning.

At present, HTSlib does not define _POSIX_C_SOURCE or _XOPEN_SOURCE on macOS. However it does use POSIX and XSI functionality, so there is a point of view that this should be tidied up and config.h should declare these feature macros in more cases. If that happened, all HTSlib builds would never see this deprecation warning.

Hence, as long as Apple has that _POSIX_C_SOURCE escape hatch in , future compilation on macOS is not a reason to avoid sprintf.

jkbonfield commented 1 year ago

Thanks for the research.

Personally, I think the security issues fixed in snprintf don't buy that much over peer-reviwed code using sprintf, and why focus on that one function and not a myriad of others like strcpy, strcat, etc? There's a whole raft of potential problems. However it's done now so shrug.

It's a good point about those macros though. We probably ought to be defining _POSIX_C_SOURCE or maybe _XOPEN_SOURCE (in the Makefile as a hint, but commented out) given we clearly do have these as a requirement. But I expect it's a cross this bridge when we come to it sort of thing.